← all repositories

FasterDecoding/Medusa

Framework that adds extra decoding heads to LLMs to predict multiple future tokens simultaneously and accelerate generation.

2.7k stars Jupyter Notebook Inference · ServingLanguage Models
Medusa
Velocity · 7d
+2.7
★ / day
Trend
steady
star history

Medusa accelerates LLM inference by appending additional decoding heads that predict multiple future tokens in parallel. Unlike speculative decoding, it does not require a separate draft model. The original model remains unchanged while only the new heads are fine-tuned. During generation, tree-based attention combines candidates from all heads, and an acceptance scheme selects the longest plausible prefix for continued decoding.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.