Yes — hendrycks/test is open source, released under the MIT license.

What language is test written in?

hendrycks/test is primarily written in Python.

hendrycks/test has 1.6k stars on GitHub.

Where can I find test?

hendrycks/test is on GitHub at https://github.com/hendrycks/test.

← all repositories

hendrycks/test

The SAT for language models, written in 57 subjects

A benchmark that tests whether your LLM actually knows anything, or just sounds confident.

★1.6k stars Python LLMOps · Eval Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

MMLU is a multiple-choice benchmark spanning humanities, social sciences, STEM, and “other” — 57 subjects in total, from professional law to elementary mathematics. The repo provides evaluation code and a download link for the test data; the README also hosts a public leaderboard showing how major models score across categories.

The interesting bit

The leaderboard quietly reveals that scale isn’t destiny. Gopher (280B) trails Chinchilla (70B), and fine-tuned GPT-3 beats few-shot GPT-3 of the same size by ten points. The test draws from an earlier ETHICS dataset, suggesting the authors have been thinking about what we want models to know, not just what they can memorize.

Key highlights

57 subject areas, four broad categories, all multiple-choice — easy to score, hard to game
Leaderboard includes results from GPT-2 through Chinchilla with clear few-shot vs. fine-tuned distinctions
Evaluation code targets the OpenAI API, so it’s built for production-model testing
Dataset and paper date to 2021; leaderboard updated with 2022 models (Chinchilla, Flan-T5)
Requires citation of both MMLU and the underlying ETHICS dataset

Caveats

The repo itself is minimal: evaluation code plus a leaderboard, with actual test data hosted externally
No code visible for the benchmark construction or subject taxonomy — just the scoring harness

Verdict

Grab this if you’re benchmarking LLMs and need a standardized, widely-cited baseline. Skip it if you’re looking for training data or novel evaluation methodology; this is a test, not a framework.

Frequently asked

What is hendrycks/test?: A benchmark that tests whether your LLM actually knows anything, or just sounds confident.
Is test open source?: Yes — hendrycks/test is open source, released under the MIT license.
What language is test written in?: hendrycks/test is primarily written in Python.
How popular is test?: hendrycks/test has 1.6k stars on GitHub.
Where can I find test?: hendrycks/test is on GitHub at https://github.com/hendrycks/test.