arthur-ai/bench
A tool for evaluating and benchmarking LLMs against production use cases.

Velocity · 7d
+0.4
★ / day
Trend
→steady
star history
Bench is a Python library for evaluating LLMs in production contexts. It provides standardized interfaces for running test suites against different LLMs, comparing prompt variations, and measuring generation hyperparameters like temperature and token count. Users create test suites with reference outputs and evaluate candidate model responses against them, enabling side-by-side comparison of open-source versus closed-source LLM performance.