distributed-llm-pretraining-torchtitan

Solid

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

AI & Automation 175,435 stars 29875 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# TorchTitan - PyTorch Native Distributed LLM Pretraining ## Quick start TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs. **Installation**: ```bash # From PyPI (stable) pip install torchtitan # From source (latest features, requires PyTorch nightly) git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt ``` **Download tokenizer**: ```bash # Get HF token from https://huggingface.co/settings/tokens python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=... ``` **Start training on 8 GPUs**: ```bash CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh ``` ## Common workflows ### Workflow 1: Pretrain Llama 3.1 8B on single node Copy this checklist: ``` Single Node Pretraining: - [ ] Step 1: Download tokenizer - [ ] Step 2: Configure training - [ ] Step 3: Launch training - [ ] Step 4: Monitor and checkpoint ``` **Step 1: Download tokenizer** ```bash python scripts/download_hf_assets.py \ --repo_id meta-llama/Llama-3.1-8B \ --assets tokenizer \ --hf_token=YOUR_HF_TOKEN ``` **Step 2: Configure training** Edit or create a TOML config file: ```toml # llama3_8b_custom.toml [job] dump_folder = "./outputs" description = "Llama 3.1 8B training" [model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"...

Details

Author
NousResearch
Repository
NousResearch/hermes-agent
Created
10 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

distributed-llm-pretraining-torchtitan

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

27,705 Updated today
davila7
AI & Automation Solid

distributed-llm-pretraining-torchtitan

Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Featured

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

27,705 Updated today
davila7
AI & Automation Solid

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Featured

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

27,705 Updated today
davila7