petergpt/bullshit-benchmark
A benchmark suite evaluating how well AI models detect nonsense prompts and avoid confidently continuing with invalid assumptions.

BullshitBench measures AI model robustness by presenting nonsensical prompts and scoring how models respond—either by detecting and rejecting the nonsense or incorrectly accepting it as valid. Version 2 contains 100 questions across five domains: software, finance, legal, medical, and physics. The project provides a public viewer, leaderboards tracking model performance, and analyses of factors like reasoning effort, model size, and launch date on detection accuracy.