site-content-cataloglisted
Install: claude install-skill gooseworks-ai/goose-skills
# Site Content Catalog
Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.
## Quick Start
```bash
# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"
# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20
# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json
```
## Inputs
| Parameter | Required | Default | Description |
|-----------|----------|---------|-------------|
| domain | Yes | — | Domain to catalog (e.g., "example.com") |
| deep-analyze | No | 0 | Number of top pages to deep-read for content analysis |
| output | No | stdout | Path to save JSON output |
| include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |
## Cost
- **Sitemap/RSS crawling:** Free (direct HTTP requests)
- **Apify sitemap extractor (fallback):** ~$0.50 per site
- **Deep analysis:** Free (WebFetch on individual pages)
## Process
### Phase 1: Discover All Pages
The script attempts multiple methods to find all pages on a site, in order:
#### A) Sitemap.xml
1. Fetch `https://[domain]/sitemap.xml`
2. If it's a sitemap index, recursively fetch all child sitemaps
3. Common alternate locations: `/sitemap_index.xml`, `