pdf-extractlisted

Extract and clean PDF content to markdown format. Use when the user uploads a PDF file and wants to convert it to clean, readable markdown. Handles text extraction, image extraction, metadata capture, and intelligent content cleanup. Removes repeated footers, watermarks, page numbers, branding, and reorganizes fragmented content into coherent structure.
maaarcooo/claude-skills · ★ 6 · Data & Documents · score 68

Install: claude install-skill maaarcooo/claude-skills

# PDF Content Extraction Skill Extract PDF content to clean, organized markdown. ## Workflow 1. **Extract** — Run script to get raw content + metadata 2. **Analyse** — Review for patterns and issues 3. **Clean** — **Manually** remove noise (footers, watermarks, branding) 4. **Organise** — Restructure fragmented content 5. **Output** — Deliver clean markdown > **Note:** Only Step 1 uses a script. Steps 2–5 are performed manually by Claude reading and rewriting content. Do not write cleanup scripts. ## Step 1: Extract ```bash python /mnt/skills/user/pdf-extract/scripts/extract_pdf.py \ /mnt/user-data/uploads/{filename}.pdf \ /home/claude/extracted/ ``` **Options:** | Option | Description | |--------|-------------| | `--pages 1-10` | Extract specific page range | | `--method pymupdf4llm` | Force primary extractor (better formatting) | | `--method pymupdf` | Force fallback (more reliable for scanned PDFs) | | `--min-image-size 100` | Skip images smaller than 100px (filters icons) | **Output:** ``` /home/claude/extracted/ ├── {filename}.md # Raw markdown with YAML frontmatter ├── metadata.json # Structured metadata └── images/ # Extracted images (if any) ``` ## Step 2: Analyse Read the extracted markdown: ```bash cat /home/claude/extracted/{filename}.md ``` **Check YAML frontmatter for:** - `extraction_method` — Which extractor was used - `total_pages` — Document length - `has_outline` — Bookmarks exist (helps with structure) - `total_images