data-pipelinelisted
Install: claude install-skill kensaurus/cursor-kenji
# Data Pipeline Correctness
> Pipelines fail silently: a retry double-counts, a partial write corrupts a table, a schema drift poisons a dashboard, and nobody notices until the numbers are wrong. This skill bakes correctness in at build time. It complements `sbc-qa-data-integrity-audit` (which *detects* these after the fact) and the Supabase plugin (DB/Edge Functions/RLS).
## When this fires
Any job that **moves, transforms, or aggregates** data: ingestion/ETL/ELT, scheduled aggregations, edge-function workers, `pg_cron` jobs, queue consumers, webhook processors, backfills, materialized-view refreshes.
## Non-negotiables (the 5 that prevent silent corruption)
1. **Idempotency** — running the same job twice must not change the result. Retries, at-least-once queues, and overlapping cron fires are guaranteed, not hypothetical.
- Use `INSERT ... ON CONFLICT (natural_key) DO UPDATE` (upsert), not blind `INSERT`.
- Derive a deterministic dedup key from the source event, not `now()` or a random id.
- For aggregates: recompute-and-replace a window, or use idempotent deltas — never `count = count + 1` on a path that can retry.
2. **Atomicity** — a job either fully applies or not at all. No half-written batches.
- Wrap multi-row writes in a transaction; stage to a temp/raw table then swap.
- A function that writes to 3 tables must not leave 1 of them updated on failure.
3. **Data contracts** — validate shape at the boundary before trusting input.
- Parse/validate (zod / pyd