bigquery-pipeline-audit

Solid

Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.

Data & Documents 34,887 stars 4287 forks Updated today MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# BigQuery Pipeline Audit: Cost, Safety and Production Readiness You are a senior data engineer reviewing a Python + BigQuery pipeline script. Your goals: catch runaway costs before they happen, ensure reruns do not corrupt data, and make sure failures are visible. Analyze the codebase and respond in the structure below (A to F + Final). Reference exact function names and line locations. Suggest minimal fixes, not rewrites. --- ## A) COST EXPOSURE: What will actually get billed? Locate every BigQuery job trigger (`client.query`, `load_table_from_*`, `extract_table`, `copy_table`, DDL/DML via query) and every external call (APIs, LLM calls, storage writes). For each, answer: - Is this inside a loop, retry block, or async gather? - What is the realistic worst-case call count? - For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set? For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS? - Is the same SQL and params being executed more than once in a single run? Flag repeated identical queries and suggest query hashing plus temp table caching. **Flag immediately if:** - Any BQ query runs once per date or once per entity in a loop - Worst-case BQ job count exceeds 20 - `maximum_bytes_billed` is missing on any `client.query` call --- ## B) DRY RUN AND EXECUTION MODES Verify a `--mode` flag exists with at least `dry_run` and `execute` options. - `dry_run` must print the plan and estimated scope with zero billed BQ executio...

Details

Author: github
Repository: github/awesome-copilot
Created: 1 years ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Google Cloud · Cloud

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

bigquery-cost-audit

Use when reviewing BigQuery spend, query failure patterns, or scan inefficiencies -- identifying which jobs, users, or projects drive cost, or preparing optimization recommendations for a cost review.

7 Updated yesterday

yeaight7

Data & Documents Listed

verify-pipeline

Run a full health check across the MDS pipeline: ingestion (dlt/Airbyte) load status, BigQuery freshness per source, ingest reconciliation (source-vs-destination row counts), dbt model freshness, MCP server health, and raw-vs-staging row count integrity. Invoke when the user wants to confirm the pipeline is healthy or asks 'is everything working?'

1 Updated 1 weeks ago

pol-cc

Data & Documents Listed

data-pipeline

Wire ETL, ingestion, cron, edge-function, and queue jobs correctly. Use for "build a pipeline", "sync X into Y", "nightly aggregation", "cron double-counts", "dedupe", "backfill", "the numbers are wrong after a retry". Bakes in idempotency, atomic writes, data contracts, dead-letter, and observability.

4 Updated today

kensaurus