training-check

Solid

Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.

AI & Automation 11,977 stars 1099 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Training Check Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time. ## Context: $ARGUMENTS ## Constants - WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format: `entity/project/run_id`) - CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap) - REVIEWER_MODEL = `gpt-5.4` — used via Codex MCP for ambiguous cases only ## When to Use - After training is confirmed running (session alive, loss decreasing for first few steps) - Set up via CronCreate to fire periodically during training - **This skill checks training QUALITY, not process HEALTH.** Process health (session alive, GPU utilization) is [watchdog.py](../../tools/watchdog.py)'s job. ## Workflow ### Step 1: Read WandB Metrics ```python import wandb api = wandb.Api() run = api.run("<entity>/<project>/<run_id>") history = run.history() ``` If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH: ```bash ssh server "tail -100 /path/to/training.log" ``` Check these signals: - **Loss trend**: Is training loss decreasing over the last N steps? - **Eval metrics**: Are evaluation metrics improving (or at least not degrading)? - **NaN / Inf**: Any NaN or Inf values in loss or gradients? - **Spikes**: Sudden large jumps in loss (>10x normal variance)? - **Learning rate**: Is the sc...

Details

Author
wanshuiyin
Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Created
3 months ago
Last Updated
yesterday
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category