small-sample-analysislisted

End-to-end methodology for supervised machine learning on small datasets (typically 30-200 samples) where standard "throw XGBoost at it" approaches fail. Use this skill whenever the user is building a predictive model on a small dataset, especially when sample-to-feature ratios are tight, when interpretability matters as much as accuracy, when the user needs to justify model choices to non-technical stakeholders, or when they need a rigorous "diagnose-improve-verify" workflow rather than just a final model. Trigger this even if the user only asks for a specific piece (e.g. "help me pick features", "validate this model"), since small-sample problems require the full methodology to avoid silent overfitting. Also trigger for store-selection / site-selection problems, B2B sales analytics, biomedical studies, A/B test analysis with limited cohorts, and any "we only have N stores/patients/experiments and need to predict Y" scenario.
jiachengwang-punch/small-sample-analysis · ★ 2 · AI & Automation · score 71

Install: claude install-skill jiachengwang-punch/small-sample-analysis

# Small Sample Analysis A complete methodology for building defensible predictive models on small datasets (typically n < 200, often n < 50). ## When this skill applies Small-sample analysis differs fundamentally from standard ML workflows. The defaults that work on 100k+ rows actively harm models on small data: - **XGBoost/LightGBM** — overfit catastrophically; CV-R² often negative - **Single train/test split** — variance too high to draw conclusions - **Stepwise feature selection** — picks noise as signal - **Headline metric reporting (just R²)** — hides systematic bias This skill captures a methodology that handles these pitfalls explicitly. **Triggers:** - Sample size mentioned as small (< 200, especially < 50) - Feature-to-sample ratio is concerning (p/n > 0.1) - User asks "why not XGBoost" or shows confusion about model choice - User needs to justify decisions to non-technical stakeholders - User uses words like "stores", "patients", "experiments", "cohorts" with limited counts - Any predictive modeling task where the user needs interpretability + rigor ## Output language Match the user's natural language for all deliverables (Notebook markdown, Word body, chart labels, slides). Code, math notation, and standard ML abbreviations (Ridge, SHAP, R², MAPE) stay in English regardless. For non-Latin scripts, set CJK-capable fonts in matplotlib (`Noto Sans CJK JP`) and docx (`Microsoft YaHei`) to prevent `□□□` rendering bugs. ## Core principles Drill these into every