distributed-traininglisted
Install: claude install-skill thada2402/AutoResearchClaw
## Distributed Training Best Practice
1. Use DistributedDataParallel (DDP) over DataParallel for multi-GPU
2. Initialize process group: dist.init_process_group(backend='nccl')
3. Use DistributedSampler for data sharding
4. Synchronize batch norm: nn.SyncBatchNorm.convert_sync_batchnorm()
5. Only save checkpoint on rank 0
6. Scale learning rate linearly with world size
7. Use gradient accumulation for effectively larger batch sizes