lmarena/arena-hard-auto
An automatic evaluation tool for instruction-tuned LLMs with the highest correlation to Chatbot Arena among open-ended benchmarks.

Arena-Hard-Auto is an LLM benchmark that evaluates instruction-tuned models through automated evaluation. It achieves the highest correlation and separability to LMArena (Chatbot Arena) among popular open-ended benchmarks, making it useful for predicting model performance before deployment. The project supports Style Control evaluation and includes Arena-Hard-v2.0 with improved judges, harder prompts, and creative writing assessment.