moe-training

Solid

Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.

AI & Automation 9,609 stars 724 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# MoE Training: Mixture of Experts ## When to Use This Skill Use MoE Training when you need to: - **Train larger models** with limited compute (5× cost reduction vs dense models) - **Scale model capacity** without proportional compute increase - **Achieve better performance** per compute budget than dense models - **Specialize experts** for different domains/tasks/languages - **Reduce inference latency** with sparse activation (only 13B/47B params active in Mixtral) - **Implement SOTA models** like Mixtral 8x7B, DeepSeek-V3, Switch Transformers **Notable MoE Models**: Mixtral 8x7B (Mistral AI), DeepSeek-V3, Switch Transformers (Google), GLaM (Google), NLLB-MoE (Meta) ## Installation ```bash # DeepSpeed with MoE support pip install deepspeed>=0.6.0 # Megatron-DeepSpeed for large-scale training git clone https://github.com/microsoft/Megatron-DeepSpeed cd Megatron-DeepSpeed pip install -r requirements.txt # Alternative: HuggingFace Transformers pip install transformers accelerate ``` ## Quick Start ### Basic MoE Architecture ```python import torch import torch.nn as nn class MoELayer(nn.Module): """Sparse Mixture of Experts layer.""" def __init__(self, hidden_size, num_experts=8, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k # Expert networks (FFN) self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(hidden_size, 4 * hidden_size), nn.GELU()...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

moe-training

27,984 Updated today

davila7

AI & Automation Solid

neural-training

Neural pattern training with SONA (Self-Optimizing Neural Architecture), MoE (Mixture of Experts), and EWC++ for knowledge consolidation. Use when: pattern learning, model optimization, knowledge transfer, adaptive routing. Skip when: simple tasks, no learning required, one-off operations.

59,062 Updated today

ruvnet

AI & Automation Featured

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

27,984 Updated today

davila7