vllm-performance-tuninglisted
Install: claude install-skill air-gapped/skills
# vLLM performance tuning
Target: operators deploying models on new hardware, chasing throughput / latency / goodput SLOs, or diagnosing perf regressions. Current through v0.21.0 stable (2026-05-15); v0.20.x stable since 2026-04-27. Last freshened 2026-05-28.
Companion skills: `vllm-benchmarking` (measure), `vllm-caching` (KV), `vllm-nvidia-hardware` (GPU/GEMM), `vllm-configuration` (env vars), `vllm-observability` (metrics).
## Tuning levers (apply by goal, not in fixed order)
**Always first — characterize the workload.** ISL / OSL / req/s / concurrency / SLO (P95 TTFT, P95 TPOT, P95 ITL). "Goodput" = tok/s/GPU **under SLO**, not raw tok/s. Everything below is keyed off these numbers.
**Parallelism + MoE kernels (biggest single wins):**
- **Pick parallelism** (see `references/moe-and-ep.md`) — model-fits-1-GPU → TP=1 + replicas (DP); MoE MLA (DeepSeek/Kimi-K2) → DP-attn + EP; multi-node → TP intra + PP inter OR Wide-EP.
- **MoE on a new SKU → run `benchmark_moe.py --tune`** — generates `E=*,N=*,device_name=*.json` configs. Without tuned configs vLLM logs "Using default MoE config. Performance might be sub-optimal!" = 20-40% throughput loss.
- **Wide-EP** (`--enable-expert-parallel --enable-eplb --enable-dbo`) for DeepSeek/Qwen3/Kimi-K2 at ≥16 GPUs.
**Throughput / batching:**
- **`auto_tune.sh`** (`benchmarks/auto_tune/`) sweeps `max_num_seqs × max_num_batched_tokens`.
- **`--gpu-memory-utilization`** — raise from 0.90 toward 0.95 until steady OOM margin, then back off.