vocabtrim-speculative-decodinglisted

Accelerate speculative decoding by pruning drafter vocabulary to high-frequency tokens. Achieves 16% speedup in memory-bound settings by eliminating unused vocabulary entries without retraining.
ADu2021/skillXiv · ★ 3 · Code & Development · score 67

Install: claude install-skill ADu2021/skillXiv

# VocabTrim: Memory-Efficient Vocabulary Pruning for Speculative Decoding Speculative decoding uses a small drafter model to propose multiple tokens per inference step, which the target verifier model accepts or rejects. This approach can 2-3× speedup inference when the drafter is fast. However, the drafter's language modeling head (the final layer outputting logits over all vocabulary tokens) becomes a memory bottleneck. For Llama-3 with 128K vocabulary tokens, computing logits over all 128K tokens at every step wastes memory and computation even though the drafter only samples from a tiny subset of frequently-occurring tokens. VocabTrim solves this by reconstructing the drafter's vocabulary to contain only the high-frequency tokens it actually samples during inference. The insight is that drafters are biased toward "easy-to-predict" tokens (common words, punctuation) and rarely sample rare tokens. By trimming the vocabulary to the most frequent 25-50K tokens, you eliminate 60-75% of the LM head computation with negligible impact on acceptance rates. ## Core Concept VocabTrim works on a simple principle: **replace the drafter's full vocabulary with a smaller one containing only tokens it frequently generates**. The key insights are: 1. Drafters naturally focus on high-frequency, predictable tokens because they're trying to be fast and accurate 2. Target models have access to the full vocabulary during verification, so drafter coverage doesn't need to be complete 3. Toke