The benchmark that killed itself by winning
A coding benchmark evolved into a reasoning stress-test that auto-scales difficulty so models can never truly "pass."

What it does
Can-AI-Code started as a simple question: can LLMs write valid code? Two years later, the answer is “obviously yes,” so the project pivoted hard. It’s now a self-scaling reasoning benchmark that generates unlimited unique problems across two difficulty axes—length (working memory stress) and depth (structural complexity)—then measures how far each model climbs before failing.
The interesting bit
The author ran 200+ million tokens through consumer RTX 3090s in his basement (blowing breakers in the process) and found models have distinct “cognitive fingerprints.” OpenAI crushes boolean logic but chokes on tokenization; Qwen’s smaller models get 250% boosts from extra thinking time; Llama is the balanced generalist. The benchmark auto-toughens when models cluster above 90% accuracy, so it theoretically can’t go stale.
Key highlights
- Parametric generators create infinite unique problems—no memorization, no fixed test sets
- Measures three things: height (max difficulty reached), efficiency (tokens burned), and constrained performance (limited resources)
- Identified working memory as the “universal bottleneck” and tokenization as a persistent Achilles heel across nearly all models
- Framework is domain-agnostic; author plans spatial reasoning, causal inference, and creative synthesis next
- Consumer-hardware research: two RTX 3090s, blown fuses, and a lot of curiosity
Caveats
- The new “Can-AI-Think” benchmark suite is described as “available soon”—the README is essentially a pre-release announcement
- Results cited (80% boolean logic accuracy, 250% Qwen boost) lack methodological detail; replication would require the unreleased generators
- The auto-scaling difficulty mechanism sounds elegant but is untested at scale—no evidence yet that it won’t create its own ceiling
Verdict
Worth watching if you benchmark models or study reasoning architectures. Skip it if you need something you can run today—the new suite isn’t out yet, and the original coding benchmark is explicitly retired. The real signal here is the framework design, not the current repo contents.