when-debugging-ml-training-use-ml-training-debugger

Solid

Debug ML training issues and optimize performance including loss divergence, overfitting, and slow convergence

AI & Automation 335 stars 29 forks Updated today

Install

View on GitHub

Quality Score: 85/100

Stars 20%
84
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
80
License 10%
0
Description 5%
100

Skill Content

# ML Training Debugger - Diagnose and Fix Training Issues ## Overview Systematic debugging workflow for ML training issues including loss divergence, overfitting, slow convergence, gradient problems, and performance optimization. ## When to Use - Training loss becomes NaN or infinite - Severe overfitting (train >> val performance) - Training not converging - Gradient vanishing/exploding - Poor validation accuracy - Training too slow ## Phase 1: Diagnose Issue (8 min) ### Objective Identify the specific training problem ### Agent: ML-Developer **Step 1.1: Analyze Training Curves** ```python import json import numpy as np # Load training history with open('training_history.json', 'r') as f: history = json.load(f) # Diagnose issues diagnosis = { 'loss_divergence': check_loss_divergence(history['loss']), 'overfitting': check_overfitting(history['loss'], history['val_loss']), 'slow_convergence': check_convergence_rate(history['loss']), 'gradient_issues': check_gradient_health(history), 'nan_values': any(np.isnan(history['loss'])) } def check_loss_divergence(losses): # Loss increasing over time if len(losses) > 10: recent_trend = np.mean(losses[-5:]) > np.mean(losses[-10:-5]) return recent_trend def check_overfitting(train_loss, val_loss): # Val loss diverging from train loss if len(train_loss) > 10: gap = np.mean(val_loss[-5:]) - np.mean(train_loss[-5:]) return gap > 0.5 # Significant gap def che...

Details

Author
aiskillstore
Repository
aiskillstore/marketplace
Created
5 months ago
Last Updated
today
Language
Python
License
None

Similar Skills

Semantically similar based on skill content — not just same category