FasterDecoding/Medusa
Framework that adds extra decoding heads to LLMs to predict multiple future tokens simultaneously and accelerate generation.

Velocity · 7d
+2.7
★ / day
Trend
→steady
star history
Medusa accelerates LLM inference by appending additional decoding heads that predict multiple future tokens in parallel. Unlike speculative decoding, it does not require a separate draft model. The original model remains unchanged while only the new heads are fine-tuned. During generation, tree-based attention combines candidates from all heads, and an acceptance scheme selects the longest plausible prefix for continued decoding.