The SAT for language models, written in 57 subjects
A benchmark that tests whether your LLM actually knows anything, or just sounds confident.

What it does
MMLU is a multiple-choice benchmark spanning humanities, social sciences, STEM, and “other” — 57 subjects in total, from professional law to elementary mathematics. The repo provides evaluation code and a download link for the test data; the README also hosts a public leaderboard showing how major models score across categories.
The interesting bit
The leaderboard quietly reveals that scale isn’t destiny. Gopher (280B) trails Chinchilla (70B), and fine-tuned GPT-3 beats few-shot GPT-3 of the same size by ten points. The test draws from an earlier ETHICS dataset, suggesting the authors have been thinking about what we want models to know, not just what they can memorize.
Key highlights
- 57 subject areas, four broad categories, all multiple-choice — easy to score, hard to game
- Leaderboard includes results from GPT-2 through Chinchilla with clear few-shot vs. fine-tuned distinctions
- Evaluation code targets the OpenAI API, so it’s built for production-model testing
- Dataset and paper date to 2021; leaderboard updated with 2022 models (Chinchilla, Flan-T5)
- Requires citation of both MMLU and the underlying ETHICS dataset
Caveats
- The repo itself is minimal: evaluation code plus a leaderboard, with actual test data hosted externally
- No code visible for the benchmark construction or subject taxonomy — just the scoring harness
Verdict
Grab this if you’re benchmarking LLMs and need a standardized, widely-cited baseline. Skip it if you’re looking for training data or novel evaluation methodology; this is a test, not a framework.