llama-cpplisted
Install: claude install-skill tdimino/claude-code-minoan
# llama.cpp - Secondary Inference Engine
Direct access to llama.cpp for faster inference, LoRA adapter loading, and benchmarking on Apple Silicon. Ollama remains primary for RLAMA and general use; llama.cpp is the power tool.
## Prerequisites
```bash
brew install llama.cpp
```
Binaries: `llama-cli`, `llama-server`, `llama-embedding`, `llama-quantize`
## Quick Reference
### Resolve Ollama Model to GGUF Path
To avoid duplicating model files, resolve an Ollama model name to its GGUF blob path:
```bash
~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b
```
### Run Inference
```bash
GGUF=$(~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b)
llama-cli -m "$GGUF" -p "Your prompt here" -n 128 --n-gpu-layers all --single-turn --simple-io --no-display-prompt
```
### Start API Server
To start an OpenAI-compatible server (port 8081, avoids Ollama's 11434):
```bash
~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>
# Or with options:
PORT=8082 CTX=8192 ~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>
```
Test the server:
```bash
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'
```
### Serve Qwen3.5
Dedicated servers for Qwen3.5 models with asymmetric KV cache, jinja templates, and thinking mode.
**9B Dense (recommended for 24-36GB systems):**
```bash
# Default: Qwen3.5-9B, thinking mode, 32K context
~