carlini/yet-another-applied-llm-benchmark
A practical benchmark for evaluating language models on real-world tasks derived from the author's own LLM usage.

Velocity · 7d
+1.2
★ / day
Trend
→steady
star history
A benchmark framework that tests LLMs on diverse applied tasks like code generation, parsing, decompilation, and format identification. It uses a custom dataflow DSL where tests chain operations via the » operator, combining LLM execution, code running (in Docker containers), and output evaluation. Includes nearly 100 test cases covering tasks the author has actually asked LLMs to perform.