deploy-kimi-k26-on-rtx-pro-6000listed
Install: claude install-skill soulmachine/skills
# Deploy Kimi-K2.6 (INT4 QAT or NVFP4) on 8× RTX PRO 6000 Blackwell Server Edition (sm_120)
Serve **Kimi-K2.6** (1T MoE; MLA; 256K; MoonViT vision) in a **user-chosen quantization**, with an
**official-image Docker** container — OpenAI-compatible API on `:30000`, **TP=8**, weights bind-mounted
read-only from local NVMe, all in VRAM. Both the **quantization** and the **engine** are chosen at
deploy time (steps 2–3) with a hardware-based recommendation:
- **INT4 QAT** — [moonshotai/Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6); compressed-tensors → **Marlin** (auto); **vLLM or SGLang**; stock images, no patch. **Recommended on sm_120** (official, simplest, both engines verified).
- **NVFP4** — [nvidia/Kimi-K2.6-NVFP4](https://huggingface.co/nvidia/Kimi-K2.6-NVFP4); ModelOpt FP4 (`--quantization modelopt_fp4`); **vLLM only** (SGLang NVFP4 = NaN on sm_120). Needs a **patched CUDA-13 image** (`build_nvfp4_image.sh`) + **offline remote-code prep** (`prep_remote_code.sh`). On sm_120 it gives **no throughput win** (PCIe-comm-bound: Marlin ≈ native b12x, native is actually ~12% slower) — prefer it on **datacenter Blackwell (sm_100/B200)** where native FP4 (cutedsl) is tuned, or when you specifically need the NVFP4 checkpoint.
**Hardware target:** 8× **RTX PRO 6000 Blackwell Server Edition** (GB202, 96 GB, sm_120) — ~595 GB of
weights + KV cache need the full 8×96 GB pool; PCIe-only, no NVLink. Same-chip Workstation/Max-Q
variants should behave identically (unverified). The