ocr-and-documentslisted

Extract text from PDFs/scans (pymupdf, marker-pdf).
aashutosh396/mindpalace · ★ 0 · AI & Automation · score 78

Install: claude install-skill aashutosh396/mindpalace

# PDF & Document Extraction For DOCX: use `python-docx` (parses actual document structure, far better than OCR). For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support). This skill covers **PDFs and scanned documents**. ## Step 1: Remote URL Available? If the document has a URL, download it first, then extract locally: ```bash curl -fsSL -o /tmp/doc.pdf "https://arxiv.org/pdf/2402.03300" curl -fsSL -o /tmp/report.pdf "https://example.com/report.pdf" ``` Then run pymupdf/marker on the downloaded file (steps below). If your client has a built-in web-fetch / URL-to-markdown tool, you can use that for a quick text grab instead — but `curl` + pymupdf works everywhere with no extra deps. ## Step 2: Choose Local Extractor | Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) | |---------|-----------------|---------------------| | **Text-based PDF** | ✅ | ✅ | | **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) | | **Tables** | ✅ (basic) | ✅ (high accuracy) | | **Equations / LaTeX** | ❌ | ✅ | | **Code blocks** | ❌ | ✅ | | **Forms** | ❌ | ✅ | | **Headers/footers removal** | ❌ | ✅ | | **Reading order detection** | ❌ | ✅ | | **Images extraction** | ✅ (embedded) | ✅ (with context) | | **Images → text (OCR)** | ❌ | ✅ | | **EPUB** | ✅ | ✅ | | **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) | | **Install size** | ~25MB | ~3-5GB (PyTorch + models) | | **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) | **Decision**: Use pymupdf un