serving-llms-vllm

Featured

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

AI & Automation 27,705 stars 2858 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# vLLM - High-Performance LLM Serving ## Quick start vLLM achieves 24x higher throughput than standard transformers through PagedAttention (block-based KV cache) and continuous batching (mixing prefill/decode requests). **Installation**: ```bash pip install vllm ``` **Basic offline inference**: ```python from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3-8B-Instruct") sampling = SamplingParams(temperature=0.7, max_tokens=256) outputs = llm.generate(["Explain quantum computing"], sampling) print(outputs[0].outputs[0].text) ``` **OpenAI-compatible server**: ```bash vllm serve meta-llama/Llama-3-8B-Instruct # Query with OpenAI SDK python -c " from openai import OpenAI client = OpenAI(base_url='http://localhost:8000/v1', api_key='EMPTY') print(client.chat.completions.create( model='meta-llama/Llama-3-8B-Instruct', messages=[{'role': 'user', 'content': 'Hello!'}] ).choices[0].message.content) " ``` ## Common workflows ### Workflow 1: Production API deployment Copy this checklist and track progress: ``` Deployment Progress: - [ ] Step 1: Configure server settings - [ ] Step 2: Test with limited traffic - [ ] Step 3: Enable monitoring - [ ] Step 4: Deploy to production - [ ] Step 5: Verify performance metrics ``` **Step 1: Configure server settings** Choose configuration based on your model size: ```bash # For 7B-13B models on single GPU vllm serve meta-llama/Llama-3-8B-Instruct \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ ...

Details

Author: davila7
Repository: davila7/claude-code-templates
Created: 11 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid