← all repositories

MoonshotAI/Moonlight

A 3B/16B parameter Mixture-of-Expert language model trained with 5.7T tokens using the Muon optimizer, achieving roughly 2× computational efficiency compared to AdamW.

Moonlight
Velocity · 7d
+3.2
★ / day
Trend
steady
star history

The project introduces Moonlight, a foundation model that improves the Pareto frontier by achieving better performance with fewer training FLOPs than prior models. It identifies two crucial techniques for scaling the Muon optimizer to large models: adding weight decay and carefully adjusting per-parameter update scale. The team open-sources a distributed Muon implementation that is memory optimal and communication efficient, along with pretrained, instruction-tuned, and intermediate model checkpoints.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.