llama-cpp

Solid

Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.

AI & Automation 175,435 stars 29875 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# llama.cpp Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware. ## When to use llama.cpp **Use llama.cpp when:** - Running on CPU-only machines - Deploying on Apple Silicon (M1/M2/M3/M4) - Using AMD or Intel GPUs (no CUDA) - Edge deployment (Raspberry Pi, embedded systems) - Need simple deployment without Docker/Python **Use TensorRT-LLM instead when:** - Have NVIDIA GPUs (A100/H100) - Need maximum throughput (100K+ tok/s) - Running in datacenter with CUDA **Use vLLM instead when:** - Have NVIDIA GPUs - Need Python-first API - Want PagedAttention ## Quick start ### Installation ```bash # macOS/Linux brew install llama.cpp # Or build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # With Metal (Apple Silicon) make LLAMA_METAL=1 # With CUDA (NVIDIA) make LLAMA_CUDA=1 # With ROCm (AMD) make LLAMA_HIP=1 ``` ### Download model ```bash # Download from HuggingFace (GGUF format) huggingface-cli download \ TheBloke/Llama-2-7B-Chat-GGUF \ llama-2-7b-chat.Q4_K_M.gguf \ --local-dir models/ # Or convert from HuggingFace python convert_hf_to_gguf.py models/llama-2-7b-chat/ ``` ### Run inference ```bash # Simple chat ./llama-cli \ -m models/llama-2-7b-chat.Q4_K_M.gguf \ -p "Explain quantum computing" \ -n 256 # Max tokens # Interactive chat ./llama-cli \ -m models/llama-2-7b-chat.Q4_K_M.gguf \ --interactive ``` ### Server mode ```bash # Start OpenAI-compatible serv...

Details

Author: NousResearch
Repository: NousResearch/hermes-agent
Created: 10 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

llama-cpp

27,705 Updated today

davila7

AI & Automation Solid

llama-cpp

9,182 Updated 1 months ago

Orchestra-Research

AI & Automation Listed

llama-cpp

Secondary local LLM inference engine via llama.cpp. This skill should be used when running GGUF models directly, loading LoRA adapters for Kothar, benchmarking inference speed, or serving models via llama-server. Includes dedicated Qwen 3.5 serve scripts (9B dense with F16 option, 35B MoE) with asymmetric KV cache and thinking mode. Complements Ollama (which remains primary for RLAMA and general use).

33 Updated 2 days ago

tdimino

AI & Automation Featured

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

27,705 Updated today

davila7

AI & Automation Solid

tensorrt-llm

175,435 Updated today

NousResearch