ml-feature-engineeringlisted

Feature store patterns, training/serving skew prevention, feature pipelines for ML teams, point-in-time correct joins, and bridging data engineering with MLOps conventions. Use this skill whenever an ML team needs feature pipelines, when building a feature store or deciding whether to use one, when there's a training/serving skew problem (model performance in production differs from validation), when features need to be shared across multiple models, or when designing point-in-time correct feature computation. Also trigger when the user mentions feature stores (Feast, Tecton, Hopsworks), label leakage, backfilling features, offline/online store separation, or when data engineering work feeds directly into model training. If the words "features", "training set", "model pipeline", or "MLOps" appear alongside data engineering questions, this skill should be active.
Methasit-Pun/data_engineer_claude_skills · ★ 1 · Data & Documents · score 62

Install: claude install-skill Methasit-Pun/data_engineer_claude_skills

# ML Feature Engineering Patterns ## The Data Engineering / ML Boundary Data engineers own the data. ML engineers own the models. Feature engineering sits at the boundary — and when it's designed poorly, both sides pay for it. The most common failures: 1. **Training/serving skew** — features computed differently at training time vs. prediction time → model performs worse in production than in validation 2. **Label leakage** — features computed using data from the future, making the training set artificially easy → model fails in production 3. **Feature duplication** — each ML team recomputes the same features independently → inconsistent definitions, wasted compute The patterns in this skill address all three. --- ## Point-in-Time Correct Joins This is the most important concept in ML feature engineering. A model trained to predict churn should only "see" data that would have been available at the time of the prediction — not data from the future. ### The wrong way ```sql -- BAD: This leaks future data into training -- If we're predicting churn as of 2024-01-15, we shouldn't know about -- events that happened on 2024-01-20 SELECT u.user_id, u.subscription_tier, COUNT(e.event_id) AS events_last_30d, -- counts events AFTER the label date! u.churned AS label FROM users u JOIN events e ON u.user_id = e.user_id AND e.event_date >= DATEADD(day, -30, CURRENT_DATE) -- wrong: uses today's date WHERE u.label_date = '2024-01-15' GROUP BY 1, 2, 4; ``` ###