← ClaudeAtlas

web-scrapinglisted

Clean LLM-ready web scraping via Firecrawl (scrape/crawl/map/extract/search). Trigger when the user wants to extract content from a page, crawl a site, collect structured data, bypass anti-bot/JS-rendering, or perform a web search with integrated extraction. Fallback to Playwright/curl if Firecrawl is unavailable.
christopherlouet/claude-base · ★ 4 · AI & Automation · score 80
Install: claude install-skill christopherlouet/claude-base
# Web Scraping (Firecrawl-first) ## Goal Extract LLM-ready web content without hacking around: clean markdown, structured JSON, anti-bot and JS-rendering handled. Firecrawl is the reference wrapper; fallback to Playwright or `curl + html2text` if unavailable. ## When to trigger this skill - "scrape this page / this site" - "extract data from ..." - "crawl site X" - "fetch all articles from ..." - "search the web and extract the content" - "parse this dynamic page" (site with JS-rendering) - "bypass the paywall / anti-bot" (legitimate use only) ## When NOT to use this skill - Quick web search without structured extraction -> `WebSearch` is enough - A single static URL, simple page -> `WebFetch` is enough - Visual test / browser interaction -> skill `qa-chrome` or agent-browser - Form / login automation -> agent-browser or Playwright directly ## Prerequisites ### Option 1: Firecrawl cloud (recommended) ```bash export FIRECRAWL_API_KEY="fc-xxx" # https://firecrawl.dev npm install -g firecrawl # or pip install firecrawl-py ``` ### Option 2: Firecrawl self-hosted Docker compose available on github.com/mendableai/firecrawl. Useful if data is sensitive or budget is limited. ### Option 3: Fallback without Firecrawl If Firecrawl is missing, degrade gracefully: | Need | Fallback | Limitation | |--------|----------|------------| | Simple static page | `curl -sL URL \| pandoc -f html -t markdown` | No JS rendering | | JS-heavy page | `npx playwright` + `p