cloud-infra-datalisted
Install: claude install-skill Methasit-Pun/data_engineer_claude_skills
# Cloud Infrastructure for Data Pipelines
## Service Selection Guide
### Compute (query engines)
| Service | Best fit | Cost model |
|---|---|---|
| **BigQuery** | Variable/spiky workloads, serverless preference | Per-TB scanned (on-demand) or slot reservations |
| **Snowflake** | Multi-cloud, strong SQL, virtual warehouse isolation | Per-credit (compute time) |
| **Redshift** | AWS-native, predictable workloads, RA3 storage separation | Per-node/hour or serverless per-RPU |
| **Databricks** | Spark workloads, ML/data science teams | DBU per hour |
| **Athena** | Ad-hoc queries on S3, minimal ops | Per-TB scanned |
The biggest practical difference: BigQuery and Athena are serverless (no cluster to manage); Snowflake and Redshift require you to think about concurrency and warehouse sizing.
### Storage
| Service | Use for |
|---|---|
| **S3 (AWS)** | Data lake, staging area, Parquet/Delta/Iceberg tables |
| **GCS (GCP)** | Same as S3 in the GCP ecosystem |
| **ADLS Gen2 (Azure)** | Azure data lake, hierarchical namespace for Hadoop compatibility |
All three are object stores — they look like key-value stores, not filesystems. The "folder" structure in the key name is just a naming convention.
---
## Storage Layout and Partitioning
### S3 / GCS bucket layout
```
s3://my-data-lake/
raw/
source=salesforce/
year=2024/month=01/day=15/
events_20240115_001.parquet
processed/
domain=churn/
year=2024/month=01/
churn_features_20240101.pa