firecrawl-data-handling

Featured

Process, validate, and store Firecrawl scraped content with deduplication and chunking. Use when handling scraped markdown, implementing content pipelines, building RAG knowledge bases, or processing crawl results for downstream consumption. Trigger with phrases like "firecrawl data", "firecrawl content processing", "firecrawl markdown cleaning", "firecrawl storage", "firecrawl RAG pipeline".

AI & Automation 2,359 stars 334 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Firecrawl Data Handling ## Overview Process scraped web content from Firecrawl pipelines. Covers markdown cleaning, structured data extraction with Zod validation, content deduplication, chunking for LLM/RAG, and storage patterns for crawled content. ## Instructions ### Step 1: Content Cleaning ```typescript import FirecrawlApp from "@mendable/firecrawl-js"; const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY!, }); // Scrape with clean output settings async function scrapeClean(url: string) { const result = await firecrawl.scrapeUrl(url, { formats: ["markdown"], onlyMainContent: true, // strips nav, footer, sidebar excludeTags: ["script", "style", "nav", "footer", "iframe"], waitFor: 2000, }); return { url: result.metadata?.sourceURL || url, title: result.metadata?.title || "", markdown: cleanMarkdown(result.markdown || ""), scrapedAt: new Date().toISOString(), }; } function cleanMarkdown(md: string): string { return md .replace(/\n{3,}/g, "\n\n") // collapse multiple newlines .replace(/\[.*?\]\(javascript:.*?\)/g, "") // remove JS links .replace(/!\[.*?\]\(data:.*?\)/g, "") // remove inline data URIs .replace(//g, "") // remove HTML comments .replace(/<script[\s\S]*?<\/script>/gi, "") // remove script tags .trim(); } ``` ### Step 2: Structured Extraction with Validation ```typescript import { z } from "zod"...

Details

Author: jeremylongshore
Repository: jeremylongshore/claude-code-plugins-plus-skills
Created: 8 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Listed

firecrawl

Firecrawl produces cleaner markdown than WebFetch, handles JavaScript-heavy pages, and avoids content truncation. This skill should be used when fetching URLs, scraping web pages, converting URLs to markdown, extracting web content, searching the web, crawling sites, mapping URLs, LLM-powered extraction, autonomous data gathering with the Agent API, interacting with scraped pages (clicking, filling forms, extracting dynamic content via Interact API), or fetching AI-generated documentation for GitHub repos via DeepWiki. Provides complete coverage of Firecrawl v2 API endpoints including parallel agents, spark-1-fast model, sitemap-only crawling, and the Interact API for post-scrape browser interaction.

34 Updated yesterday

tdimino

AI & Automation Featured

firecrawl-core-workflow-b

Execute Firecrawl secondary workflow: LLM extraction, batch scraping, and site mapping. Use when extracting structured data from pages, batch scraping known URLs, or discovering site structure with the map endpoint. Trigger with phrases like "firecrawl extract", "firecrawl batch scrape", "firecrawl map site", "firecrawl structured data", "firecrawl JSON extract".

2,359 Updated today

jeremylongshore

AI & Automation Featured

firecrawl-core-workflow-a

Execute Firecrawl primary workflow: scrape and crawl websites into LLM-ready markdown. Use when scraping single pages, crawling entire sites, or building content ingestion pipelines with Firecrawl's scrapeUrl and crawlUrl methods. Trigger with phrases like "firecrawl scrape", "firecrawl crawl site", "scrape page to markdown", "crawl documentation".

2,359 Updated today

jeremylongshore