← ClaudeAtlas

ebook-ingestlisted

Use this skill when the user wants to find, download, and prepare an ebook for AI agent ingestion (RAG, fine-tuning, or long-context reference). Triggers include: requests to acquire a digital copy of a book the user owns in print, building a personal book corpus for an AI agent, converting EPUB/PDF/MOBI to clean Markdown for LLM consumption, or chunking books for vector stores. Handles search across multiple sources (Gutenberg, Standard Ebooks, Anna's Archive, LibGen, Z-Library, archive.org), format conversion via calibre/pandoc, OCR for scanned PDFs, cleanup, metadata, and chunking. Do NOT use for academic papers (use Sci-Hub/unpaywall), bulk public-domain scraping (hit Gutenberg's API directly), or DRM'd commercial ebooks the user has not purchased.
tomcounsell/ai · ★ 14 · AI & Automation · score 70
Install: claude install-skill tomcounsell/ai
# Ebook acquisition and AI ingestion prep ## Overview End-to-end pipeline for turning a named book into clean, structured Markdown ready for an AI agent to consume. Covers search → download → convert → clean → chunk. Assumes the user owns a print copy and is creating a personal digital backup for private AI use. Skip this skill if that premise doesn't hold. ## Quick reference | Step | Tool | Output | |------|------|--------| | Search | Anna's Archive (meta), Gutenberg, Standard Ebooks | candidate file URLs | | Download | `curl` / `wget` | `library/raw/<slug>.<ext>` | | Convert | `pandoc` (EPUB), `pdftotext -layout` (PDF), `ocrmypdf` (scanned) | raw `.md` or `.txt` | | Clean | `clean_book.py` | normalized Markdown | | Metadata | YAML frontmatter | `<slug>.md` | | Chunk (optional) | `langchain` text splitters | `chunks/*.json` | ## Prerequisites ```bash # macOS brew install calibre pandoc poppler tesseract ocrmypdf # Ubuntu/Debian apt-get install calibre pandoc poppler-utils tesseract-ocr ocrmypdf # Python pip install ebooklib beautifulsoup4 markdownify pymupdf langchain-text-splitters httpx ``` ## Configuration Anna's Archive supports authenticated programmatic downloads for paid members. Two env vars are required: ```bash # In ~/Desktop/Valor/.env (already set on this machine) ANNAS_ARCHIVE_ACCOUNT_ID="<account-id>" ANNAS_ARCHIVE_SECRET_KEY="<secret-key>" ``` Both values are available at `https://annas-archive.org/account` after donating. They are passed as query