ocr-and-documents

Solid

Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.

Data & Documents 175,435 stars 29875 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# PDF & Document Extraction For DOCX: use `python-docx` (parses actual document structure, far better than OCR). For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support). This skill covers **PDFs and scanned documents**. ## Step 1: Remote URL Available? If the document has a URL, **always try `web_extract` first**: ``` web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) web_extract(urls=["https://example.com/report.pdf"]) ``` This handles PDF-to-markdown conversion via Firecrawl with no local dependencies. Only use local extraction when: the file is local, web_extract fails, or you need batch processing. ## Step 2: Choose Local Extractor | Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) | |---------|-----------------|---------------------| | **Text-based PDF** | ✅ | ✅ | | **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) | | **Tables** | ✅ (basic) | ✅ (high accuracy) | | **Equations / LaTeX** | ❌ | ✅ | | **Code blocks** | ❌ | ✅ | | **Forms** | ❌ | ✅ | | **Headers/footers removal** | ❌ | ✅ | | **Reading order detection** | ❌ | ✅ | | **Images extraction** | ✅ (embedded) | ✅ (with context) | | **Images → text (OCR)** | ❌ | ✅ | | **EPUB** | ✅ | ✅ | | **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) | | **Install size** | ~25MB | ~3-5GB (PyTorch + models) | | **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) | **Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis. If the use...

Details

Author
NousResearch
Repository
NousResearch/hermes-agent
Created
10 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category