deploy-kimi-k26-on-rtx-pro-6000listed

Deploy and serve Moonshot Kimi-K2.6 (1T MoE, MLA, 256K context, vision) in a user-chosen quantization — official INT4 QAT (moonshotai/Kimi-K2.6, compressed-tensors→Marlin; vLLM or SGLang) or NVFP4 (nvidia/Kimi-K2.6-NVFP4, ModelOpt FP4; vLLM only — SGLang NVFP4 is NaN-broken on sm_120) — on a Linux server (verified Ubuntu 26.04) with 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB, sm_120) GPUs. The quantization and the engine are both chosen at deploy time with a hardware-based recommendation. Runs an official-image Docker container via nvidia-container-toolkit CDI (--device nvidia.com/gpu=all --ipc=host --network host, bind-mounted weights; --network host is required for NCCL's GPU-to-GPU transport / IB-RoCE GPUDirect RDMA, and the server binds 0.0.0.0 with off-box clients reaching it through an authenticated Caddy proxy that upstreams over loopback 127.0.0.1 — structural, no firewall), exposing an OpenAI-compatible API on :30000 behind one static systemd service `kimi-k26` (quant + engine selected vi
soulmachine/skills · ★ 3 · DevOps & Infrastructure · score 74

Install: claude install-skill soulmachine/skills

# Deploy Kimi-K2.6 (INT4 QAT or NVFP4) on 8× RTX PRO 6000 Blackwell Server Edition (sm_120) Serve **Kimi-K2.6** (1T MoE; MLA; 256K; MoonViT vision) in a **user-chosen quantization**, with an **official-image Docker** container — OpenAI-compatible API on `:30000`, **TP=8**, weights bind-mounted read-only from local NVMe, all in VRAM. Both the **quantization** and the **engine** are chosen at deploy time (steps 2–3) with a hardware-based recommendation: - **INT4 QAT** — [moonshotai/Kimi-K2.6](https://huggingface.co/moonshotai/Kimi-K2.6); compressed-tensors → **Marlin** (auto); **vLLM or SGLang**; stock images, no patch. **Recommended on sm_120** (official, simplest, both engines verified). - **NVFP4** — [nvidia/Kimi-K2.6-NVFP4](https://huggingface.co/nvidia/Kimi-K2.6-NVFP4); ModelOpt FP4 (`--quantization modelopt_fp4`); **vLLM only** (SGLang NVFP4 = NaN on sm_120). Needs a **patched CUDA-13 image** (`build_nvfp4_image.sh`) + **offline remote-code prep** (`prep_remote_code.sh`). On sm_120 it gives **no throughput win** (PCIe-comm-bound: Marlin ≈ native b12x, native is actually ~12% slower) — prefer it on **datacenter Blackwell (sm_100/B200)** where native FP4 (cutedsl) is tuned, or when you specifically need the NVFP4 checkpoint. **Hardware target:** 8× **RTX PRO 6000 Blackwell Server Edition** (GB202, 96 GB, sm_120) — ~595 GB of weights + KV cache need the full 8×96 GB pool; PCIe-only, no NVLink. Same-chip Workstation/Max-Q variants should behave identically (unverified). The