ray-train

Solid

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Ray Train - Distributed Training Orchestration ## Quick start Ray Train scales machine learning training from single GPU to multi-node clusters with minimal code changes. **Installation**: ```bash pip install -U "ray[train]" ``` **Basic PyTorch training** (single node): ```python import ray from ray import train from ray.train import ScalingConfig from ray.train.torch import TorchTrainer import torch import torch.nn as nn # Define training function def train_func(config): # Your normal PyTorch code model = nn.Linear(10, 1) optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # Prepare for distributed (Ray handles device placement) model = train.torch.prepare_model(model) for epoch in range(10): # Your training loop output = model(torch.randn(32, 10)) loss = output.sum() loss.backward() optimizer.step() optimizer.zero_grad() # Report metrics (logged automatically) train.report({"loss": loss.item(), "epoch": epoch}) # Run distributed training trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=4, # 4 GPUs/workers use_gpu=True ) ) result = trainer.fit() print(f"Final loss: {result.metrics['loss']}") ``` **That's it!** Ray handles: - Distributed coordination - GPU allocation - Fault tolerance - Checkpointing - Metric aggregation ## Common workflows ### Workflow 1: Scale existing PyTorch code **Original single-GPU code*...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

ray-distributed-trainer

Distributed computing skill using Ray for parallel training, hyperparameter search, and resource management.

1,160 Updated today

a5c-ai

Data & Documents Solid

ray-data

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

1,436 Updated 6 days ago

OpenRaiser

Data & Documents Solid

ray-data

9,182 Updated 1 months ago

Orchestra-Research

AI & Automation Solid

huggingface-accelerate

Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.

175,435 Updated today

NousResearch

AI & Automation Featured

huggingface-accelerate

27,705 Updated today

davila7