parsing-documents

Solid

Extract structured data from PDF documents — text, tables, forms, and metadata. Use when reading or extracting content from a `.pdf` file, parsing invoices/reports/scanned documents, or converting PDF data to JSON/CSV. NOT for generating PDFs, and NOT for plain-text/markdown files (read those directly).

Data & Documents 33 stars 5 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 87/100

Stars 20%
51
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Parsing Documents (PDF Extraction) Extract structured information from PDF documents. Try the cheapest reliable method first. ## Parsing Strategy (Priority Order) ### 1. Native Reading (when supported) Use the runtime's native PDF reader only when it explicitly supports PDF inputs. If the current reader supports only text/images, skip to CLI tools or page-image conversion instead of claiming the PDF was read. ``` Read → /path/to/document.pdf # only on runtimes with PDF support ``` ### 2. CLI Tools (quick operations) ```bash pdftotext document.pdf output.txt # Extract text pdftotext -layout document.pdf - # Preserve layout pdfinfo document.pdf # PDF metadata pdftoppm -png document.pdf output # Convert pages to images ``` ### 3. Python Libraries (complex extraction) `pdfplumber` — best for tables: ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() text = page.extract_text() ``` `pypdf` — metadata and forms: ```python from pypdf import PdfReader reader = PdfReader("document.pdf") metadata = reader.metadata fields = reader.get_form_text_fields() ``` ## Common Extraction Patterns ### Tables ```python def extract_tables(pdf_path): tables = [] with pdfplumber.open(pdf_path) as pdf: for page_num, page in enumerate(pdf.pages, 1): for table in page.extract_tables(): tables.append({"page"...

Details

Author
alexei-led
Repository
alexei-led/cc-thingz
Created
11 months ago
Last Updated
1 weeks ago
Language
Python
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category