data-engineering

Featured

数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。

Data & Documents 5,403 stars 413 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# 数据工程域 · Data Engineering ## 域概览 数据工程域涵盖数据管道编排、流式处理、数据质量保障三大核心领域。 ``` 数据管道层 流处理层 质量保障层 ├── Airflow (调度编排) ├── Kafka Streams ├── Great Expectations ├── Dagster (资产管理) ├── Flink ├── dbt └── Prefect (现代工作流) └── Spark Streaming └── Soda Core ``` --- ## 数据管道编排 ### 框架对比 | 特性 | Airflow | Dagster | Prefect | |------|---------|---------|---------| | 核心模型 | DAG + Task | Asset + Op | Flow + Task | | 学习曲线 | 陡峭 | 中等 | 平缓 | | 资产管理 | 无 | 原生支持 | 无 | | 动态任务 | 支持 | 支持 | 支持 | | 本地开发 | 复杂 | 简单 | 简单 | | 社区生态 | 最大 | 成长中 | 成长中 | ### Airflow 核心模式 - DAG 定义:`with DAG(dag_id, schedule, default_args) as dag` - TaskFlow API:`@task` 装饰器,自动 XCom 传递 - 动态任务:`@task` + `.expand()` 实现 dynamic task mapping - Operators:PythonOperator / BashOperator / SQL / HTTP / S3 - Sensors:FileSensor / HttpSensor / ExternalTaskSensor - 重试策略:`retries=3, retry_delay=timedelta(minutes=5), retry_exponential_backoff=True` - 失败回调:`on_failure_callback` 发送告警 - SLA 监控:`sla=timedelta(hours=2)` + `sla_miss_callback` ### Dagster 核心模式 - Asset 定义:`@asset(group_name, deps)` 声明数据资产 - MaterializeResult:返回元数据(行数、预览等) - Resources:`ConfigurableResource` 管理外部连接 - Jobs:`define_asset_job(selection=AssetSelection.groups(...))` - Schedules:`ScheduleDefinition(job, cron_schedule)` - Sensors:`@sensor(job)` 监听外部事件触发 - Partitions:`DailyPartitionsDefinition` 按日分区 - Asset Checks:`@asset_check` 验证数据新鲜度/质量 ### Prefect 核心模式 - Flow/Task:`@flow`...

Details

Author
fengshao1227
Repository
fengshao1227/ccg-workflow
Created
4 months ago
Last Updated
2 days ago
Language
Go
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Solid

data-engineering

数据工程(Airflow/Dagster/Kafka/Flink/dbt、数据管道、ETL、流处理、数据质量)。

13 Updated 5 days ago
wzyxdwll
Data & Documents Listed

data-pipeline

【数据管道】ETL 管道设计、Airflow/dbt 模式、数据验证、监控告警。 触发时机: - 用户要求"设计数据管道"、"ETL流程" - 需要搭建 Airflow DAG - 数据转换和验证 提供完整的数据管道设计方案。

0 Updated 2 days ago
afine907
AI & Automation Solid

data-engineering-master

数据工程 — 数据平台从业者的认知操作系统, 覆盖把数据从源系统搬运成可靠 / 可查询 / 可信赖形态供分析 / ML / 数据产品消费的全生命周期 (生成 → 摄取 → 存储 → 转换 → 服务 + 安全/数据管理/DataOps/数据架构/编排/软件工程 六条暗流, Reis & Housley 框架): 摄取与集成 (批 + CDC 变更数据捕获 Debezium + EL 工具 Fivetran/Airbyte/Meltano/dlt + Kafka Connect + schema drift) / 存储与文件表格式 (对象存储数据湖 + 列存 Parquet/ORC/Arrow/Avro + 开放表格式 Apache Iceberg/Delta Lake/Apache Hudi + lakehouse + 分区/compaction) / 转换与建模 (ELT dbt/SQLMesh + Spark + 维度建模 Kimball + Inmon + Data Vault + 大宽表 OBT + 渐变维 SCD + 增量模型 + 语义/指标层) / 编排与工作流 (Apache Airflow/Dagster/Prefect/Mage/Kestra/Apache DolphinScheduler + DAG + 幂等 + 回填 backfill + 数据资产调度) / 批流与实时 (Apache Kafka/Apache Flink/Spark Structured Streaming/Kinesis/Pulsar/Redpanda + Lambda vs Kappa + watermark/窗口/exactly-once + 流式 SQL Materialize/RisingWave + 实时 OLAP ClickHouse/Apache Druid/Apache Pinot/StarRocks/Apache Doris) / 数仓与查询引擎 (Snowflake/BigQuery/Redshift/Databricks SQL/Trino/Presto/DuckDB/Polars + 存算分离 + MPP) / 数据质量测试与可观测性 (dbt tests/Great Expectations/Soda + 数据契约 + Monte Carlo data downtime + 新鲜度/量/schem

38 Updated 4 days ago
swaylq
Data & Documents Listed

orchestration-patterns

Airflow/Prefect/Dagster DAG design — task dependencies, retries, SLAs, backfill strategies, sensors, and failure recovery. Use this skill whenever the user is building or debugging a scheduled pipeline with multiple steps, asking how to handle task failures, setting up retries or alerts, designing a DAG structure, choosing between orchestrators, or dealing with backfill/reprocessing of historical data. Also trigger when the user mentions Airflow operators, Prefect flows, Dagster assets, task queues, or pipeline scheduling — even if they don't say "orchestration" explicitly. If a pipeline has more than two steps and needs to run on a schedule, this skill should be active.

0 Updated 5 days ago
Methasit-Pun
Data & Documents Featured

data-engineering-data-pipeline

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

39,350 Updated today
sickn33