pjlab-sys4nlp/llama-moe
A project that constructs sparse Mixture-of-Experts models by partitioning LLaMA's feed-forward networks into experts with top-K routing gates.

LLaMA-MoE builds sparse Mixture-of-Experts models from LLaMA by partitioning feed-forward networks into experts and inserting top-K routing gates at each layer. The initialized MoE models are then continually pre-trained on optimized data sampling from SlimPajama and filtered datasets. This approach achieves reduced activated parameter counts (3.0-3.5B) compared to dense LLaMA models while maintaining language modeling capabilities.