serving-llms-vllmlisted

vLLM: high-throughput LLM serving, OpenAI API, quantization.
aashutosh396/mindpalace · ★ 0 · AI & Automation · score 78

Install: claude install-skill aashutosh396/mindpalace

# vLLM - High-Performance LLM Serving ## When to use Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism. ## Quick start vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests). **Installation**: ```bash pip install vllm ``` **Basic offline inference**: ```python from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3-8B-Instruct") sampling = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["Explain quantum computing"], sampling) print(outputs[0].outputs[0].text) ``` **OpenAI-compatible server**: ```bash vllm serve meta-llama/Llama-3-8B-Instruct # Query with OpenAI SDK python -c " from openai import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY') print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}] ).choices[0].message.content) " ``` ## Common workflows ### Workflow 1: Production API deployment Copy this checklist and track progress: ``` Deployment Progress: - [ ] Step 1: Configure server settings - [ ] Step 2: Test with limited traffic - [ ] Step 3: Enable monitoring - [ ] Step 4: Deploy to production - [ ] Step 5: Verify performance metrics ```