when-debugging-ml-training-use-ml-training-debugger

Solid

Debug ML training issues and optimize performance including loss divergence, overfitting, and slow convergence

AI & Automation 335 stars 29 forks Updated today

Install

View on GitHub

Quality Score: 85/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

Description 5%

100

Skill Content

# ML Training Debugger - Diagnose and Fix Training Issues ## Overview Systematic debugging workflow for ML training issues including loss divergence, overfitting, slow convergence, gradient problems, and performance optimization. ## When to Use - Training loss becomes NaN or infinite - Severe overfitting (train >> val performance) - Training not converging - Gradient vanishing/exploding - Poor validation accuracy - Training too slow ## Phase 1: Diagnose Issue (8 min) ### Objective Identify the specific training problem ### Agent: ML-Developer **Step 1.1: Analyze Training Curves** ```python import json import numpy as np # Load training history with open('training_history.json', 'r') as f: history = json.load(f) # Diagnose issues diagnosis = { 'loss_divergence': check_loss_divergence(history['loss']), 'overfitting': check_overfitting(history['loss'], history['val_loss']), 'slow_convergence': check_convergence_rate(history['loss']), 'gradient_issues': check_gradient_health(history), 'nan_values': any(np.isnan(history['loss'])) } def check_loss_divergence(losses): # Loss increasing over time if len(losses) > 10: recent_trend = np.mean(losses[-5:]) > np.mean(losses[-10:-5]) return recent_trend def check_overfitting(train_loss, val_loss): # Val loss diverging from train loss if len(train_loss) > 10: gap = np.mean(val_loss[-5:]) - np.mean(train_loss[-5:]) return gap > 0.5 # Significant gap def che...

Details

Author: aiskillstore
Repository: aiskillstore/marketplace
Created: 5 months ago
Last Updated: today
Language: Python
License: None

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

ml-training-debugger

Diagnose machine learning training failures including loss divergence, mode collapse, gradient issues, architecture problems, and optimization failures. This skill spawns a specialist ML debugging ...

335 Updated today

aiskillstore

AI & Automation Solid

ml-model-training

Train ML models with scikit-learn, PyTorch, TensorFlow. Use for classification/regression, neural networks, hyperparameter tuning, or encountering overfitting, underfitting, convergence issues.

162 Updated 2 weeks ago

secondsky

AI & Automation Solid

when-developing-ml-models-use-ml-expert

Specialized ML model development, training, and deployment workflow

335 Updated today

aiskillstore

AI & Automation Listed

ml-antipattern-validator

Prevents 30+ critical AI/ML mistakes including data leakage, evaluation errors, training pitfalls, and deployment issues. Use when working with ML training, testing, model evaluation, or deployment.

335 Updated today

aiskillstore

AI & Automation Listed

paper-train

Use this skill when the user wants to configure training parameters, set hyperparameters, debug training issues, or analyze training results. Triggers include: "training config", "hyperparameters", "learning rate", "batch size", "training parameters", "training failed", "loss is NaN", "OOM error", "training debugging". Also use when evaluating trained models or generating result tables and figures.

2 Updated 4 days ago

charlotte-12s