tensorrt-llm

Featured

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

AI & Automation 27,705 stars 2858 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# TensorRT-LLM NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs. ## When to use TensorRT-LLM **Use TensorRT-LLM when:** - Deploying on NVIDIA GPUs (A100, H100, GB200) - Need maximum throughput (24,000+ tokens/sec on Llama 3) - Require low latency for real-time applications - Working with quantized models (FP8, INT4, FP4) - Scaling across multiple GPUs or nodes **Use vLLM instead when:** - Need simpler setup and Python-first API - Want PagedAttention without TensorRT compilation - Working with AMD GPUs or non-NVIDIA hardware **Use llama.cpp instead when:** - Deploying on CPU or Apple Silicon - Need edge deployment without NVIDIA GPUs - Want simpler GGUF quantization format ## Quick start ### Installation ```bash # Docker (recommended) docker pull nvidia/tensorrt_llm:latest # pip install pip install tensorrt_llm==1.2.0rc3 # Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12 ``` ### Basic inference ```python from tensorrt_llm import LLM, SamplingParams # Initialize model llm = LLM(model="meta-llama/Meta-Llama-3-8B") # Configure sampling sampling_params = SamplingParams( max_tokens=100, temperature=0.7, top_p=0.9 ) # Generate prompts = ["Explain quantum computing"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.text) ``` ### Serving with trtllm-serve ```bash # Start server (automatic model download and compilation) trtllm-serve meta-llama/Meta-Ll...

Details

Author
davila7
Repository
davila7/claude-code-templates
Created
11 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

175,435 Updated today
NousResearch
AI & Automation Solid

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Featured

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

27,705 Updated today
davila7
AI & Automation Solid

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

175,435 Updated today
NousResearch
AI & Automation Solid

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

9,182 Updated 1 months ago
Orchestra-Research