distributed-traininglisted

Multi-GPU and distributed training patterns with PyTorch DDP. Use when scaling training across GPUs.
thada2402/AutoResearchClaw · ★ 1 · AI & Automation · score 73

Install: claude install-skill thada2402/AutoResearchClaw

## Distributed Training Best Practice 1. Use DistributedDataParallel (DDP) over DataParallel for multi-GPU 2. Initialize process group: dist.init_process_group(backend='nccl') 3. Use DistributedSampler for data sharding 4. Synchronize batch norm: nn.SyncBatchNorm.convert_sync_batchnorm() 5. Only save checkpoint on rank 0 6. Scale learning rate linearly with world size 7. Use gradient accumulation for effectively larger batch sizes