One model, one engine: antirez bets the house on DeepSeek V4
A deliberately narrow inference engine that treats your SSD as first-class KV cache real estate.
What it does DwarfStar runs DeepSeek V4 Flash (and PRO, if you have 512GB) locally on Metal and CUDA. It is not a generic GGUF loader: it ships its own quantization recipes, prompt rendering, tool calling, HTTP server, and even a coding agent. The author calls it “beta quality” and means it.
The interesting bit The project treats KV cache as a “first-class disk citizen,” exploiting DeepSeek’s compressed cache and fast Mac SSDs to persist state across sessions. The 2-bit quantization is genuinely asymmetric: only routed MoE experts get squeezed, while shared experts and projections stay pristine. The README openly admits the code was built with “strong assistance from GPT 5.5” — a disclosure that doubles as a warning.
Key highlights
- Targets 96–128GB MacBooks for Flash; 512GB for PRO
- 1M token context window with on-disk KV cache persistence
- Custom GGUFs with imatrix-tuned 2-bit quants; won’t run arbitrary GGUFs
- CPU path exists only for diagnostics; macOS CPU builds currently kernel-panic the OS
- Includes
ds4-agent(alpha), speed benchmarks, and official-logit regression tests
Caveats
- Beta quality, days-old in places;
ds4-agentis alpha - macOS CPU inference crashes the kernel due to an Apple VM bug the author could not work around
- PRO support is experimental; PRO GGUF generation still relies on external llama.cpp tooling
- MTP speculative decoding is correctness-gated and currently offers “at most a slight speedup”
Verdict Worth a look if you own a loaded Mac Studio or DGX Spark and want a polished, opinionated DeepSeek V4 experience rather than wrangling generic loaders. Skip if you need broad model support, run Linux CPU-only, or flinch at AI-assisted C code.