EleutherAI/lm-evaluation-harness
A Python framework for few-shot evaluation of language models across standard benchmarks.

Velocity · 7d
+6.1
★ / day
Trend
→steady
star history
The lm-evaluation-harness provides a standardized framework for evaluating language models using few-shot prompting techniques. It supports evaluation on standard benchmarks and leaderboards, with backend support for HuggingFace transformers, vLLM, and SGLang. The tool enables reproducible evaluation of model capabilities across tasks like reasoning, question answering, and multimodal understanding.