671B parameters, 37B active: the efficiency trick behind DeepSeek-V3
A massive Mixture-of-Experts model that trains cheap and runs lean by keeping most of its weights asleep.

What it does
DeepSeek-V3 is a 671-billion-parameter language model that only activates 37 billion parameters per token. It uses Mixture-of-Experts (MoE) architecture with Multi-head Latent Attention, supports 128K context windows, and comes in both base and chat-tuned versions. The weights are fully open on Hugging Face.
The interesting bit
The real engineering story is the training cost: 2.664M H800 GPU hours for pre-training on 14.8 trillion tokens, plus a mere 0.1M hours for everything after. They pulled this off with FP8 mixed precision—validated at this scale for the first time—and by nearly eliminating the communication bottleneck in cross-node MoE training. They also distilled reasoning patterns from their own DeepSeek-R1 chain-of-thought model without letting the outputs get verbose and weird.
Key highlights
- Auxiliary-loss-free load balancing: keeps experts evenly utilized without the usual performance penalty
- Multi-Token Prediction module (14B extra weights) that can double as speculative decoding for faster inference
- Benchmarks place it ahead of open-source rivals and competitive with closed models on math, code, and long-context tasks
- Training was reportedly stable: “no irrecoverable loss spikes or rollbacks”
- MIT license for code, separate model agreement for weights
Caveats
- MTP support is “currently under active development within the community”—not fully baked yet
- Running this locally requires serious hardware; the repo points to community partnerships and vendor-specific guides rather than one universal setup
- The 685B total download size is not trivial
Verdict
Worth studying if you’re training large MoEs or optimizing distributed systems. If you just need a capable API, you can use DeepSeek’s hosted version and skip the infrastructure headache.