llama-cpp

Solid

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

AI & Automation 175,435 stars 29875 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# llama.cpp Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware. ## When to use llama.cpp **Use llama.cpp when:** - Running on CPU-only machines - Deploying on Apple Silicon (M1/M2/M3/M4) - Using AMD or Intel GPUs (no CUDA) - Edge deployment (Raspberry Pi, embedded systems) - Need simple deployment without Docker/Python **Use TensorRT-LLM instead when:** - Have NVIDIA GPUs (A100/H100) - Need maximum throughput (100K+ tok/s) - Running in datacenter with CUDA **Use vLLM instead when:** - Have NVIDIA GPUs - Need Python-first API - Want PagedAttention ## Quick start ### Installation ```bash # macOS/Linux brew install llama.cpp # Or build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # With Metal (Apple Silicon) make LLAMA_METAL=1 # With CUDA (NVIDIA) make LLAMA_CUDA=1 # With ROCm (AMD) make LLAMA_HIP=1 ``` ### Download model ```bash # Download from HuggingFace (GGUF format) huggingface-cli download \ TheBloke/Llama-2-7B-Chat-GGUF \ llama-2-7b-chat.Q4_K_M.gguf \ --local-dir models/ # Or convert from HuggingFace python convert_hf_to_gguf.py models/llama-2-7b-chat/ ``` ### Run inference ```bash # Simple chat ./llama-cli \ -m models/llama-2-7b-chat.Q4_K_M.gguf \ -p "Explain quantum computing" \ -n 256 # Max tokens # Interactive chat ./llama-cli \ -m models/llama-2-7b-chat.Q4_K_M.gguf \ --interactive ``` ### Server mode ```bash # Start OpenAI-compatible serv...

Details

Author
NousResearch
Repository
NousResearch/hermes-agent
Created
10 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

27,705 Updated today
davila7
AI & Automation Solid

llama-cpp

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Listed

llama-cpp

Secondary local LLM inference engine via llama.cpp. This skill should be used when running GGUF models directly, loading LoRA adapters for Kothar, benchmarking inference speed, or serving models via llama-server. Includes dedicated Qwen 3.5 serve scripts (9B dense with F16 option, 35B MoE) with asymmetric KV cache and thinking mode. Complements Ollama (which remains primary for RLAMA and general use).

33 Updated 2 days ago
tdimino
AI & Automation Featured

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

27,705 Updated today
davila7
AI & Automation Solid

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

175,435 Updated today
NousResearch