moe-training

Solid

Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.

AI & Automation 9,609 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# MoE Training: Mixture of Experts ## When to Use This Skill Use MoE Training when you need to: - **Train larger models** with limited compute (5× cost reduction vs dense models) - **Scale model capacity** without proportional compute increase - **Achieve better performance** per compute budget than dense models - **Specialize experts** for different domains/tasks/languages - **Reduce inference latency** with sparse activation (only 13B/47B params active in Mixtral) - **Implement SOTA models** like Mixtral 8x7B, DeepSeek-V3, Switch Transformers **Notable MoE Models**: Mixtral 8x7B (Mistral AI), DeepSeek-V3, Switch Transformers (Google), GLaM (Google), NLLB-MoE (Meta) ## Installation ```bash # DeepSpeed with MoE support pip install deepspeed>=0.6.0 # Megatron-DeepSpeed for large-scale training git clone https://github.com/microsoft/Megatron-DeepSpeed cd Megatron-DeepSpeed pip install -r requirements.txt # Alternative: HuggingFace Transformers pip install transformers accelerate ``` ## Quick Start ### Basic MoE Architecture ```python import torch import torch.nn as nn class MoELayer(nn.Module): """Sparse Mixture of Experts layer.""" def __init__(self, hidden_size, num_experts=8, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k # Expert networks (FFN) self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(hidden_size, 4 * hidden_size), nn.GELU()...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category