training-check

Install

View on GitHub

Quality Score: 96/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

70

Documentation 15%

100

Issue Health 10%

50

License 10%

100

Description 5%

100

Skill Content

# Training Check Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time. ## Context: $ARGUMENTS ## Constants - WANDB_ENTITY and WANDB_PROJECT: read from CLAUDE.md or passed as argument (format: `entity/project/run_id`) - CHECK_INTERVAL: starts at 10 minutes, then gradually increases if consistently healthy: 10 min → 20 min → 30 min → 60 min (cap) - REVIEWER_MODEL = `gpt-5.4` — used via Codex MCP for ambiguous cases only ## When to Use - After training is confirmed running (session alive, loss decreasing for first few steps) - Set up via CronCreate to fire periodically during training - **This skill checks training QUALITY, not process HEALTH.** Process health (session alive, GPU utilization) is [watchdog.py](../../tools/watchdog.py)'s job. ## Workflow ### Step 1: Read WandB Metrics ```python import wandb api = wandb.Api() run = api.run("<entity>/<project>/<run_id>") history = run.history() ``` If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH: ```bash ssh server "tail -100 /path/to/training.log" ``` Check these signals: - **Loss trend**: Is training loss decreasing over the last N steps? - **Eval metrics**: Are evaluation metrics improving (or at least not degrading)? - **NaN / Inf**: Any NaN or Inf values in loss or gradients? - **Spikes**: Sudden large jumps in loss (>10x normal variance)? - **Learning rate**: Is the sc...

Details

Author: wanshuiyin
Repository: wanshuiyin/Auto-claude-code-research-in-sleep
Created: 3 months ago
Last Updated: yesterday
Language: Python
License: MIT

Install

Quality Score: 96/100

Skill Content

Details

Integrates with

Similar Skills

weights-and-biases

weights-and-biases

weights-and-biases