speculative-decoding

Solid

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Speculative Decoding: Accelerating LLM Inference ## When to Use This Skill Use Speculative Decoding when you need to: - **Speed up inference** by 1.5-3.6× without quality loss - **Reduce latency** for real-time applications (chatbots, code generation) - **Optimize throughput** for high-volume serving - **Deploy efficiently** on limited hardware - **Generate faster** without changing model architecture **Key Techniques**: Draft model speculative decoding, Medusa (multiple heads), Lookahead Decoding (Jacobi iteration) **Papers**: Medusa (arXiv 2401.10774), Lookahead Decoding (ICML 2024), Speculative Decoding Survey (ACL 2024) ## Installation ```bash # Standard speculative decoding (transformers) pip install transformers accelerate # Medusa (multiple decoding heads) git clone https://github.com/FasterDecoding/Medusa cd Medusa pip install -e . # Lookahead Decoding git clone https://github.com/hao-ai-lab/LookaheadDecoding cd LookaheadDecoding pip install -e . # Optional: vLLM with speculative decoding pip install vllm ``` ## Quick Start ### Basic Speculative Decoding (Draft Model) ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load target model (large, slow) target_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", device_map="auto", torch_dtype=torch.float16 ) # Load draft model (small, fast) draft_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", t...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

speculative-decoding

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

27,705 Updated today
davila7
AI & Automation Listed

vllm-speculative-decoding

Pick, configure, tune, monitor vLLM speculative decoding in production. Eleven SpeculativeMethod options (ngram, ngram_gpu, medusa, mlp_speculator, draft_model, suffix, eagle, eagle3, dflash, mtp, extract_hidden_states), `--speculative-config` JSON schema, which methods pair with which target model family, Prometheus acceptance metric surface, version gates (v0.11.1 EAGLE-3 preamble fix, v0.16 parallel drafting, v0.18 ngram_gpu, v0.19 dflash and zero-bubble), composability with chunked prefill / PP / LoRA / FP8 / structured outputs, Arctic Inference plugin, where spec-dec stops paying at high batch.

3 Updated yesterday
air-gapped
Code & Development Listed

vocabtrim-speculative-decoding

Accelerate speculative decoding by pruning drafter vocabulary to high-frequency tokens. Achieves 16% speedup in memory-bound settings by eliminating unused vocabulary entries without retraining.

3 Updated 2 months ago
ADu2021
AI & Automation Featured

model-pruning

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

27,705 Updated today
davila7
AI & Automation Solid

model-pruning

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

9,182 Updated 1 months ago
Orchestra-Research