Is lm-evaluation-harness open source?

Yes — EleutherAI/lm-evaluation-harness is open source, released under the MIT license.

What language is lm-evaluation-harness written in?

EleutherAI/lm-evaluation-harness is primarily written in Python.

How popular is lm-evaluation-harness?

EleutherAI/lm-evaluation-harness has 13.4k stars on GitHub and is currently accelerating.

Where can I find lm-evaluation-harness?

EleutherAI/lm-evaluation-harness is on GitHub at https://github.com/EleutherAI/lm-evaluation-harness.

← all repositories

EleutherAI/lm-evaluation-harness

The de facto standard for benchmarking LLMs, warts and all

A framework that turned "how good is this model really?" from a weekend science project into a one-line command — and became the backend for Hugging Face's Open LLM Leaderboard along the way.

★13.4k stars Python LLMOps · Eval

View on GitHub ↗ Homepage ↗

Velocity · 7d

+12

★ / day

Trend

↗accelerating

star history

What it does

lm-evaluation-harness is a unified framework for few-shot evaluation of generative language models. It bundles 60+ standard academic benchmarks with hundreds of subtasks, exposes them through a consistent CLI and Python API, and supports a laundry list of backends: HuggingFace Transformers, vLLM, SGLang, GPT-NeoX, Megatron-DeepSpeed, and commercial APIs like OpenAI. It also handles the fiddly bits — quantization via GPTQ/AutoGPTQ, LoRA adapters through PEFT, GGUF models, multi-GPU via Accelerate, and even a prototype multimodal path.

The interesting bit

The project became infrastructure by accident. It is the backend for Hugging Face’s Open LLM Leaderboard, has been cited in hundreds of papers, and is used internally by NVIDIA, Cohere, BigScience, and others. The maintainers recently decoupled the base install from heavy dependencies like transformers and torch — you now install backends à la carte (lm_eval[hf], lm_eval[vllm], etc.) — which suggests the project has matured from research tool to something resembling a platform.

Key highlights

60+ benchmarks, hundreds of subtasks, all with publicly available prompts for reproducibility
Tokenization-agnostic interface that abstracts over wildly different model architectures
Jinja2-based prompt design with imports from Promptsource, plus YAML-configurable tasks
Automatic batch-size detection (auto:4 recomputes periodically) and data-parallel multi-GPU evaluation
Support for stripping chain-of-thought reasoning traces via think_end_token on recent releases

Caveats

Multimodal support is explicitly prototyped and incomplete; the README nudges users toward the fork lmms-eval instead
GGUF evaluation carries a sharp edge: omit a separate tokenizer path and HuggingFace may spend hours reconstructing one from the binary, or hang indefinitely
Multi-node evaluation is noted as not natively supported, though the README truncates before elaborating

Verdict

If you are publishing LLM results and want reviewers to believe them, you should probably be using this or explaining why not. If you are building a narrow, custom evaluation pipeline for a single model and task, the abstraction overhead may not pay for itself.

Frequently asked

What is EleutherAI/lm-evaluation-harness?: A framework that turned "how good is this model really?" from a weekend science project into a one-line command — and became the backend for Hugging Face's Open LLM Leaderboard along the way.
Is lm-evaluation-harness open source?: Yes — EleutherAI/lm-evaluation-harness is open source, released under the MIT license.
What language is lm-evaluation-harness written in?: EleutherAI/lm-evaluation-harness is primarily written in Python.
How popular is lm-evaluation-harness?: EleutherAI/lm-evaluation-harness has 13.4k stars on GitHub and is currently accelerating.
Where can I find lm-evaluation-harness?: EleutherAI/lm-evaluation-harness is on GitHub at https://github.com/EleutherAI/lm-evaluation-harness.