← all repositories

carlini/yet-another-applied-llm-benchmark

A practical benchmark for evaluating language models on real-world tasks derived from the author's own LLM usage.

1.1k stars Python LLMOps · Eval
yet-another-applied-llm-benchmark
Velocity · 7d
+1.2
★ / day
Trend
steady
star history

A benchmark framework that tests LLMs on diverse applied tasks like code generation, parsing, decompilation, and format identification. It uses a custom dataflow DSL where tests chain operations via the » operator, combining LLM execution, code running (in Docker containers), and output evaluation. Includes nearly 100 test cases covering tasks the author has actually asked LLMs to perform.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.