← all repositories

lmarena/arena-hard-auto

An automatic evaluation tool for instruction-tuned LLMs with the highest correlation to Chatbot Arena among open-ended benchmarks.

1k stars Python LLMOps · Eval
arena-hard-auto
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

Arena-Hard-Auto is an LLM benchmark that evaluates instruction-tuned models through automated evaluation. It achieves the highest correlation and separability to LMArena (Chatbot Arena) among popular open-ended benchmarks, making it useful for predicting model performance before deployment. The project supports Style Control evaluation and includes Arena-Hard-v2.0 with improved judges, harder prompts, and creative writing assessment.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.