speculative-decoding

Solid

Accelerate LLM inference using speculative decoding, Medusa multiple heads, and lookahead decoding techniques. Use when optimizing inference speed (1.5-3.6× speedup), reducing latency for real-time applications, or deploying models with limited compute. Covers draft models, tree-based attention, Jacobi iteration, parallel token generation, and production deployment strategies.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Speculative Decoding: Accelerating LLM Inference ## When to Use This Skill Use Speculative Decoding when you need to: - **Speed up inference** by 1.5-3.6× without quality loss - **Reduce latency** for real-time applications (chatbots, code generation) - **Optimize throughput** for high-volume serving - **Deploy efficiently** on limited hardware - **Generate faster** without changing model architecture **Key Techniques**: Draft model speculative decoding, Medusa (multiple heads), Lookahead Decoding (Jacobi iteration) **Papers**: Medusa (arXiv 2401.10774), Lookahead Decoding (ICML 2024), Speculative Decoding Survey (ACL 2024) ## Installation ```bash # Standard speculative decoding (transformers) pip install transformers accelerate # Medusa (multiple decoding heads) git clone https://github.com/FasterDecoding/Medusa cd Medusa pip install -e . # Lookahead Decoding git clone https://github.com/hao-ai-lab/LookaheadDecoding cd LookaheadDecoding pip install -e . # Optional: vLLM with speculative decoding pip install vllm ``` ## Quick Start ### Basic Speculative Decoding (Draft Model) ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load target model (large, slow) target_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-70b-hf", device_map="auto", torch_dtype=torch.float16 ) # Load draft model (small, fast) draft_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b-hf", device_map="auto", t...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

speculative-decoding

27,705 Updated today

davila7

AI & Automation Listed

vllm-speculative-decoding

Pick, configure, tune, monitor vLLM speculative decoding in production. Eleven SpeculativeMethod options (ngram, ngram_gpu, medusa, mlp_speculator, draft_model, suffix, eagle, eagle3, dflash, mtp, extract_hidden_states), `--speculative-config` JSON schema, which methods pair with which target model family, Prometheus acceptance metric surface, version gates (v0.11.1 EAGLE-3 preamble fix, v0.16 parallel drafting, v0.18 ngram_gpu, v0.19 dflash and zero-bubble), composability with chunked prefill / PP / LoRA / FP8 / structured outputs, Arctic Inference plugin, where spec-dec stops paying at high batch.

3 Updated yesterday

air-gapped

Code & Development Listed

vocabtrim-speculative-decoding

Accelerate speculative decoding by pruning drafter vocabulary to high-frequency tokens. Achieves 16% speedup in memory-bound settings by eliminating unused vocabulary entries without retraining.

3 Updated 2 months ago

ADu2021

AI & Automation Featured

model-pruning

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

27,705 Updated today

davila7

AI & Automation Solid

model-pruning

9,182 Updated 1 months ago

Orchestra-Research