ray-train

Solid

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Ray Train - Distributed Training Orchestration ## Quick start Ray Train scales machine learning training from single GPU to multi-node clusters with minimal code changes. **Installation**: ```bash pip install -U "ray[train]" ``` **Basic PyTorch training** (single node): ```python import ray from ray import train from ray.train import ScalingConfig from ray.train.torch import TorchTrainer import torch import torch.nn as nn # Define training function def train_func(config): # Your normal PyTorch code model = nn.Linear(10, 1) optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # Prepare for distributed (Ray handles device placement) model = train.torch.prepare_model(model) for epoch in range(10): # Your training loop output = model(torch.randn(32, 10)) loss = output.sum() loss.backward() optimizer.step() optimizer.zero_grad() # Report metrics (logged automatically) train.report({"loss": loss.item(), "epoch": epoch}) # Run distributed training trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=4, # 4 GPUs/workers use_gpu=True ) ) result = trainer.fit() print(f"Final loss: {result.metrics['loss']}") ``` **That's it!** Ray handles: - Distributed coordination - GPU allocation - Fault tolerance - Checkpointing - Metric aggregation ## Common workflows ### Workflow 1: Scale existing PyTorch code **Original single-GPU code*...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

ray-distributed-trainer

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.

1,160 Updated today
a5c-ai
Data & Documents Solid

ray-data

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

1,436 Updated 6 days ago
OpenRaiser
Data & Documents Solid

ray-data

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Solid

huggingface-accelerate

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

175,435 Updated today
NousResearch
AI & Automation Featured

huggingface-accelerate

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

27,705 Updated today
davila7