ingesting-datalisted

Data ingestion patterns for loading data from cloud storage, APIs, files, and streaming sources into databases. Use when importing CSV/JSON/Parquet files, pulling from S3/GCS buckets, consuming API feeds, or building ETL pipelines.
ancoleman/ai-design-components · ★ 372 · Data & Documents · score 75

Install: claude install-skill ancoleman/ai-design-components

# Data Ingestion Patterns This skill provides patterns for getting data INTO systems from external sources. ## When to Use This Skill - Importing CSV, JSON, Parquet, or Excel files - Loading data from S3, GCS, or Azure Blob storage - Consuming REST/GraphQL API feeds - Building ETL/ELT pipelines - Database migration and CDC (Change Data Capture) - Streaming data ingestion from Kafka/Kinesis ## Ingestion Pattern Decision Tree ``` What is your data source? ├── Cloud Storage (S3, GCS, Azure) → See cloud-storage.md ├── Files (CSV, JSON, Parquet) → See file-formats.md ├── REST/GraphQL APIs → See api-feeds.md ├── Streaming (Kafka, Kinesis) → See streaming-sources.md ├── Legacy Database → See database-migration.md └── Need full ETL framework → See etl-tools.md ``` ## Quick Start by Language ### Python (Recommended for ETL) **dlt (data load tool) - Modern Python ETL:** ```python import dlt # Define a source @dlt.source def github_source(repo: str): @dlt.resource(write_disposition="merge", primary_key="id") def issues(): response = requests.get(f"https://api.github.com/repos/{repo}/issues") yield response.json() return issues # Load to destination pipeline = dlt.pipeline( pipeline_name="github_issues", destination="postgres", # or duckdb, bigquery, snowflake dataset_name="github_data" ) load_info = pipeline.run(github_source("owner/repo")) print(load_info) ``` **Polars for file processing (faster than pandas):** ```python import pola