← all repositories
deepseek-ai/DeepSeek-V3

671B parameters, 37B active: the efficiency trick behind DeepSeek-V3

A massive Mixture-of-Experts model that trains cheap and runs lean by keeping most of its weights asleep.

DeepSeek-V3
Velocity · 7d
+196
★ / day
Trend
steady
star history

What it does

DeepSeek-V3 is a 671-billion-parameter language model that only activates 37 billion parameters per token. It uses Mixture-of-Experts (MoE) architecture with Multi-head Latent Attention, supports 128K context windows, and comes in both base and chat-tuned versions. The weights are fully open on Hugging Face.

The interesting bit

The real engineering story is the training cost: 2.664M H800 GPU hours for pre-training on 14.8 trillion tokens, plus a mere 0.1M hours for everything after. They pulled this off with FP8 mixed precision—validated at this scale for the first time—and by nearly eliminating the communication bottleneck in cross-node MoE training. They also distilled reasoning patterns from their own DeepSeek-R1 chain-of-thought model without letting the outputs get verbose and weird.

Key highlights

  • Auxiliary-loss-free load balancing: keeps experts evenly utilized without the usual performance penalty
  • Multi-Token Prediction module (14B extra weights) that can double as speculative decoding for faster inference
  • Benchmarks place it ahead of open-source rivals and competitive with closed models on math, code, and long-context tasks
  • Training was reportedly stable: “no irrecoverable loss spikes or rollbacks”
  • MIT license for code, separate model agreement for weights

Caveats

  • MTP support is “currently under active development within the community”—not fully baked yet
  • Running this locally requires serious hardware; the repo points to community partnerships and vendor-specific guides rather than one universal setup
  • The 685B total download size is not trivial

Verdict

Worth studying if you’re training large MoEs or optimizing distributed systems. If you just need a capable API, you can use DeepSeek’s hosted version and skip the infrastructure headache.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.