← all repositories
Luce-Org/lucebox-hub

A 27B model at 205 tok/s on a single GPU, no cloud required

Lucebox is a C++ inference server that squeezes speculative decoding and custom CUDA kernels into consumer VRAM.

lucebox-hub
Velocity · 7d
+36
★ / day
Trend
steady
star history

What it does

Lucebox runs large language models locally via an OpenAI-compatible HTTP server, targeting single-GPU setups like an RTX 3090 or 5090. It bundles three optimization tracks: Megakernel (fused CUDA kernels for small models), DFlash (speculative decode with tree verification), and PFlash (speculative prefill that compresses prompts). The project also ships a harness to plug the server into Claude Code, Codex, Open WebUI, and other clients.

The interesting bit

The speedups come from pairing a full-size “target” model with a smaller “drafter” model—sometimes the same architecture at a different scale, sometimes entirely different families like Qwen-0.6B drafting Qwen-27B. The DDTree verifier then checks multiple draft tokens in parallel rather than one-by-one. For prefill, PFlash uses a tiny drafter to predict which prompt tokens matter and skips the rest via sparse attention. The README claims up to ~5.6× combined speedup on Qwen 3.6-27B with PFlash, and 4.84× with DDTree speculative decode.

Key highlights

  • Concrete benchmarks on real hardware: RTX 3090 is the reference target; RTX 5090 hits 205 tok/s; even a 2080 Ti manages 53 tok/s. AMD Strix Halo and RX 7900 XTX are supported via HIP.
  • Aggressive KV cache quantization: TQ3_0 at 3.5 bits-per-value fits 256K context in 24 GB VRAM; Q4_0 legacy path for ~128K.
  • Draft residency controls: --draft-residency lets you evict draft weights after each request to save VRAM, or keep them pinned for speed.
  • Multi-GPU and mixed backend: target and draft can live on different GPUs, or even CUDA/HIP split via IPC.
  • No PyTorch for the server: pure CMake/C++ build against a forked llama.cpp; only the Megakernel component needs PyTorch 2.0+.

Caveats

  • Narrow model support: the headline numbers are Qwen-heavy; Gemma-4-26B only sees 1.31×, and the drafter table is short.
  • Tuning burden: DDTree budget defaults vary by card (22 for 3090, 40 for 5090), and the README suggests re-sweeping for your specific GPU.
  • WSL2 and newer archs untested or community-benched: RTX 4090 is 🟡 WSL2-only; Jetson AGX Thor builds but has no benchmarks.

Verdict

Worth a look if you’re running local LLMs on a single high-end NVIDIA (or AMD) GPU and want more throughput than stock llama.cpp without managing a cloud bill. Skip it if you need broad model coverage out of the box, or if your hardware isn’t in the tested matrix.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.