site-content-cataloglisted

Crawl a website's sitemap and blog index to build a complete content inventory. Lists every page with URL, title, publish date, content type, and topic cluster. Groups content by category and topic. Optionally deep-reads top N pages for quality analysis and funnel stage tagging. Use before SEO audits, content gap analysis, or brand voice extraction.
gooseworks-ai/goose-skills · ★ 727 · Web & Frontend · score 82

Install: claude install-skill gooseworks-ai/goose-skills

# Site Content Catalog Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages. ## Quick Start ```bash # Basic content inventory python3 scripts/catalog_content.py --domain "example.com" # With deep analysis of top 20 pages python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20 # Output to specific file python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json ``` ## Inputs | Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | domain | Yes | — | Domain to catalog (e.g., "example.com") | | deep-analyze | No | 0 | Number of top pages to deep-read for content analysis | | output | No | stdout | Path to save JSON output | | include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) | ## Cost - **Sitemap/RSS crawling:** Free (direct HTTP requests) - **Apify sitemap extractor (fallback):** ~$0.50 per site - **Deep analysis:** Free (WebFetch on individual pages) ## Process ### Phase 1: Discover All Pages The script attempts multiple methods to find all pages on a site, in order: #### A) Sitemap.xml 1. Fetch `https://[domain]/sitemap.xml` 2. If it's a sitemap index, recursively fetch all child sitemaps 3. Common alternate locations: `/sitemap_index.xml`, `