← all repositories

petergpt/bullshit-benchmark

A benchmark suite evaluating how well AI models detect nonsense prompts and avoid confidently continuing with invalid assumptions.

1.7k stars Python LLMOps · EvalLanguage Models
bullshit-benchmark
Velocity · 7d
+16
★ / day
Trend
steady
star history

BullshitBench measures AI model robustness by presenting nonsensical prompts and scoring how models respond—either by detecting and rejecting the nonsense or incorrectly accepting it as valid. Version 2 contains 100 questions across five domains: software, finance, legal, medical, and physics. The project provides a public viewer, leaderboards tracking model performance, and analyses of factors like reasoning effort, model size, and launch date on detection accuracy.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.