Luce-Org/lucebox-hub

A 27B model at 205 tok/s on a single GPU, no cloud required

Lucebox is a C++ inference server that squeezes speculative decoding and custom CUDA kernels into consumer VRAM.

★2.3k stars C++ Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Velocity · 7d

+36

★ / day

Trend

→steady

star history

What it does

Lucebox runs large language models locally via an OpenAI-compatible HTTP server, targeting single-GPU setups like an RTX 3090 or 5090. It bundles three optimization tracks: Megakernel (fused CUDA kernels for small models), DFlash (speculative decode with tree verification), and PFlash (speculative prefill that compresses prompts). The project also ships a harness to plug the server into Claude Code, Codex, Open WebUI, and other clients.

The interesting bit

The speedups come from pairing a full-size “target” model with a smaller “drafter” model—sometimes the same architecture at a different scale, sometimes entirely different families like Qwen-0.6B drafting Qwen-27B. The DDTree verifier then checks multiple draft tokens in parallel rather than one-by-one. For prefill, PFlash uses a tiny drafter to predict which prompt tokens matter and skips the rest via sparse attention. The README claims up to ~5.6× combined speedup on Qwen 3.6-27B with PFlash, and 4.84× with DDTree speculative decode.

Key highlights

Concrete benchmarks on real hardware: RTX 3090 is the reference target; RTX 5090 hits 205 tok/s; even a 2080 Ti manages 53 tok/s. AMD Strix Halo and RX 7900 XTX are supported via HIP.
Aggressive KV cache quantization: TQ3_0 at 3.5 bits-per-value fits 256K context in 24 GB VRAM; Q4_0 legacy path for ~128K.
Draft residency controls: --draft-residency lets you evict draft weights after each request to save VRAM, or keep them pinned for speed.
Multi-GPU and mixed backend: target and draft can live on different GPUs, or even CUDA/HIP split via IPC.
No PyTorch for the server: pure CMake/C++ build against a forked llama.cpp; only the Megakernel component needs PyTorch 2.0+.

Caveats

Narrow model support: the headline numbers are Qwen-heavy; Gemma-4-26B only sees 1.31×, and the drafter table is short.
Tuning burden: DDTree budget defaults vary by card (22 for 3090, 40 for 5090), and the README suggests re-sweeping for your specific GPU.
WSL2 and newer archs untested or community-benched: RTX 4090 is 🟡 WSL2-only; Jetson AGX Thor builds but has no benchmarks.

Verdict

Worth a look if you’re running local LLMs on a single high-end NVIDIA (or AMD) GPU and want more throughput than stock llama.cpp without managing a cloud bill. Skip it if you need broad model coverage out of the box, or if your hardware isn’t in the tested matrix.