← ClaudeAtlas

deploy-kimi-k26-on-rtx-pro-6000listed

Deploy and serve Moonshot Kimi-K2.6 (1T MoE, MLA, 256K context, vision) in a user-chosen quantization — official INT4 QAT (moonshotai/Kimi-K2.6, compressed-tensors→Marlin; vLLM or SGLang) or NVFP4 (nvidia/Kimi-K2.6-NVFP4, ModelOpt FP4; vLLM only — SGLang NVFP4 is NaN-broken on sm_120) — on a Linux server (verified Ubuntu 26.04) with 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB, sm_120) GPUs. The quantization and the engine are both chosen at deploy time with a hardware-based recommendation. Runs an official-image Docker container via nvidia-container-toolkit CDI (--device nvidia.com/gpu=all --ipc=host --network host, bind-mounted weights), exposing an OpenAI-compatible API on :30000 behind one static systemd service `kimi-k26` (quant + engine selected via its EnvironmentFile — only one 595 GB variant fits the 8-GPU pool at a time). Use when deploying or serving Kimi-K2.6 INT4 or NVFP4 on RTX PRO 6000 Blackwell / sm_120 hardware (vLLM-in-Docker, or SGLang-in-Docker for INT4) — or troubleshooting NCCL
soulmachine/skills · ★ 2 · DevOps & Infrastructure · score 75
Install: claude install-skill soulmachine/skills
# Deploy Kimi-K2.6 (INT4 QAT or NVFP4) on 8× RTX PRO 6000 Blackwell Server Edition (sm_120) Serve **Kimi-K2.6** (1T MoE; MLA; 256K; MoonViT vision) in a **user-chosen quantization**, with an **official-image Docker** container — OpenAI-compatible API on `:30000`, **TP=8**, weights bind-mounted read-only from local NVMe, all in VRAM. Both the **quantization** and the **engine** are chosen at deploy time (steps 2–3) with a hardware-based recommendation: - **INT4 QAT** — [moonshotai/Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6); compressed-tensors → **Marlin** (auto); **vLLM or SGLang**; stock images, no patch. **Recommended on sm_120** (official, simplest, both engines verified). - **NVFP4** — [nvidia/Kimi-K2.6-NVFP4](https://huggingface.co/nvidia/Kimi-K2.6-NVFP4); ModelOpt FP4 (`--quantization modelopt_fp4`); **vLLM only** (SGLang NVFP4 = NaN on sm_120). Needs a **patched CUDA-13 image** (`build_nvfp4_image.sh`) + **offline remote-code prep** (`prep_remote_code.sh`). On sm_120 it gives **no throughput win** (PCIe-comm-bound: Marlin ≈ native b12x, native is actually ~12% slower) — prefer it on **datacenter Blackwell (sm_100/B200)** where native FP4 (cutedsl) is tuned, or when you specifically need the NVFP4 checkpoint. **Hardware target:** 8× **RTX PRO 6000 Blackwell Server Edition** (GB202, 96 GB, sm_120) — ~595 GB of weights + KV cache need the full 8×96 GB pool; PCIe-only, no NVLink. Same-chip Workstation/Max-Q variants should behave identically (unverified). The