Is llm_benchmark open source?

Yes — llm2014/llm_benchmark is an open-source project tracked on heatdrop.

How popular is llm_benchmark?

llm2014/llm_benchmark has 1.5k stars on GitHub.

Where can I find llm_benchmark?

llm2014/llm_benchmark is on GitHub at https://github.com/llm2014/llm_benchmark.

llm2014/llm_benchmark

The LLM benchmark that fails models for being helpful

A personal, rolling evaluation that tracks how well large language models reason through private logic, math, and coding problems under strictly enforced output constraints.

★1.5k stars LLMOps · Eval

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository hosts a long-running, personal LLM evaluation that pits models against a rotating set of roughly 28 handcrafted problems spanning logic, mathematics, Python programming, spatial reasoning, and instruction following. The author runs each model via official APIs or routers, grades answers against a strict rubric, and publishes updated rankings monthly. It is deliberately modest in scope: a single observer’s lens on model evolution, not a definitive industry leaderboard.

The interesting bit

The questions are kept private and updated monthly to prevent data contamination, forcing models to reason rather than regurgitate memorized answers. Grading is unusually pedantic: a correct answer still scores zero if the model adds forbidden explanations, writes code when prohibited, or skips required derivation steps. That anti-cheese rigor is the entire point.

Key highlights

Private, rolling question bank (~28 problems, ~270 cases) drawn from no public internet sources
Heavy emphasis on logic, math, coding, spatial reasoning, and strict instruction following
Pedantic scoring: violating output format rules or omitting derivation steps earns a zero, even if the final answer is correct
Each model tested three times per question via official APIs or OpenRouter/Zenmux; highest score retained
Monthly score archiving with acknowledged ±4 point variance due to question rotation

Caveats

The author explicitly warns the benchmark is neither authoritative nor comprehensive
Small question set means monthly score swings of roughly ±4 points are normal
Questions are not public, so independent reproduction or verification is impossible

Verdict

Check the rankings if you want a contamination-resistant, reasoning-heavy sanity check on LLM progress—particularly for Chinese models—but look elsewhere if you need a large-scale, peer-reviewed evaluation suite.

Frequently asked

What is llm2014/llm_benchmark?: A personal, rolling evaluation that tracks how well large language models reason through private logic, math, and coding problems under strictly enforced output constraints.
Is llm_benchmark open source?: Yes — llm2014/llm_benchmark is an open-source project tracked on heatdrop.
How popular is llm_benchmark?: llm2014/llm_benchmark has 1.5k stars on GitHub.
Where can I find llm_benchmark?: llm2014/llm_benchmark is on GitHub at https://github.com/llm2014/llm_benchmark.