MoonshotAI/Moonlight
A 3B/16B parameter Mixture-of-Expert language model trained with 5.7T tokens using the Muon optimizer, achieving roughly 2× computational efficiency compared to AdamW.

The project introduces Moonlight, a foundation model that improves the Pareto frontier by achieving better performance with fewer training FLOPs than prior models. It identifies two crucial techniques for scaling the Muon optimizer to large models: adding weight decay and carefully adjusting per-parameter update scale. The team open-sources a distributed Muon implementation that is memory optimal and communication efficient, along with pretrained, instruction-tuned, and intermediate model checkpoints.