Is LookaheadDecoding open source?

Yes — hao-ai-lab/LookaheadDecoding is open source, released under the Apache-2.0 license.

What language is LookaheadDecoding written in?

hao-ai-lab/LookaheadDecoding is primarily written in Python.

How popular is LookaheadDecoding?

hao-ai-lab/LookaheadDecoding has 1.3k stars on GitHub.

Where can I find LookaheadDecoding?

hao-ai-lab/LookaheadDecoding is on GitHub at https://github.com/hao-ai-lab/LookaheadDecoding.

← all repositories

hao-ai-lab/LookaheadDecoding

Parallel Decoding for LLMs, No Draft Model Required

It accelerates LLaMA inference by guessing future tokens in parallel via Jacobi iteration—no draft model or data store needed.

★1.3k stars Python Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Lookahead Decoding is a parallel decoding algorithm that accelerates LLaMA inference by breaking the standard autoregressive token-by-token bottleneck. It treats decoding as solving a nonlinear system via Jacobi iteration, collects n-grams from the iteration trajectory, and verifies promising candidates in a single forward pass using a fused attention mask. The authors report single-GPU latency reductions of 1.5× to 2.3× across several datasets, with no draft model or external data store required.

The interesting bit

Jacobi decoding has long been able to predict tokens in parallel, yet it “can barely see wall-clock speedup in real-world LLM applications.” The project makes it practical by caching n-grams from the iteration history and splitting each step into two parallel branches—one that speculates ahead through a 2D window, and one that verifies matches—both packed into a single GPU attention mask.

Key highlights

Eliminates the need for a draft model or external data store, unlike speculative decoding.
Supports both greedy search and sampling, and offers optional FlashAttention integration via a specialized kernel build.
Drops into existing HuggingFace generate() calls with only a few lines of Python.
Reduces the number of decoding steps linearly relative to log(FLOPs) spent per step.
Currently limited to LLaMA architectures; adapting it to other models requires manual implementation.

Caveats

Only LLaMA is supported at present; the README notes that each model architecture requires its own adaptation.
FlashAttention support currently relies on a separate forked repository and pre-built wheels rather than a unified install.
The speedup comes from trading extra FLOPs per step for fewer sequential steps, so the gains depend on your ability to absorb additional compute per forward pass.

Verdict

A solid experiment if you are serving LLaMA on a single GPU and want lower latency without maintaining a draft model. Otherwise, wait for broader architectural support.

Frequently asked

What is hao-ai-lab/LookaheadDecoding?: It accelerates LLaMA inference by guessing future tokens in parallel via Jacobi iteration—no draft model or data store needed.
Is LookaheadDecoding open source?: Yes — hao-ai-lab/LookaheadDecoding is open source, released under the Apache-2.0 license.
What language is LookaheadDecoding written in?: hao-ai-lab/LookaheadDecoding is primarily written in Python.
How popular is LookaheadDecoding?: hao-ai-lab/LookaheadDecoding has 1.3k stars on GitHub.
Where can I find LookaheadDecoding?: hao-ai-lab/LookaheadDecoding is on GitHub at https://github.com/hao-ai-lab/LookaheadDecoding.