Is mixture-of-experts open source?

Yes — lucidrains/mixture-of-experts is open source, released under the MIT license.

What language is mixture-of-experts written in?

lucidrains/mixture-of-experts is primarily written in Python.

How popular is mixture-of-experts?

lucidrains/mixture-of-experts has 865 stars on GitHub.

Where can I find mixture-of-experts?

lucidrains/mixture-of-experts is on GitHub at https://github.com/lucidrains/mixture-of-experts.

← all repositories

lucidrains/mixture-of-experts

More weights, same workload: sparse gating in PyTorch

A PyTorch transcription of sparsely-gated Mixture of Experts for ballooning language model capacity without increasing per-token computation.

★865 stars Python ML Frameworks Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This library implements the Sparsely-Gated Mixture of Experts layer in PyTorch, letting you replace dense feed-forward blocks with a routing gate and a bank of smaller expert networks. You end up with far more total parameters—billions, if you stack hierarchical gates—while only a subset of experts fires per token, so the compute bill stays roughly flat. It is largely a line-by-line translation of the original TensorFlow implementation, with a handful of enhancements and support for custom expert architectures.

The interesting bit

The hierarchical mode stacks two levels of sparse routers, which is how the GShard paper scaled to giant model sizes without a single gate choking on an enormous expert pool. You can also inject your own expert network—say, a deeper MLP—if the default single-layer expert feels too pedestrian.

Key highlights

Top-2 gating with train- and eval-specific policies for whether to route to a second-place expert (always, never, threshold, or random).
Two-level HeirarchicalMoE for scaling expert counts beyond what a flat gate can comfortably manage.
Pluggable custom expert networks via the experts argument, so you are not locked into a default MLP shape.
Exposed knobs for capacity factors and an auxiliary balancing loss to prevent the gate from collapsing to one favorite expert.

Caveats

The author explicitly recommends using the newer st-moe-pytorch repository instead of this one for new work.
The code is largely a direct transcription of the TensorFlow reference implementation rather than a ground-up redesign.

Verdict

Grab this if you need a faithful PyTorch port of the classic sparse MoE layer for an existing codebase. If you are starting fresh, the author suggests reaching for the successor st-moe-pytorch instead.

Frequently asked

What is lucidrains/mixture-of-experts?: A PyTorch transcription of sparsely-gated Mixture of Experts for ballooning language model capacity without increasing per-token computation.
Is mixture-of-experts open source?: Yes — lucidrains/mixture-of-experts is open source, released under the MIT license.
What language is mixture-of-experts written in?: lucidrains/mixture-of-experts is primarily written in Python.
How popular is mixture-of-experts?: lucidrains/mixture-of-experts has 865 stars on GitHub.
Where can I find mixture-of-experts?: lucidrains/mixture-of-experts is on GitHub at https://github.com/lucidrains/mixture-of-experts.