Is prm800k open source?

Yes — openai/prm800k is open source, released under the MIT license.

What language is prm800k written in?

openai/prm800k is primarily written in Python.

How popular is prm800k?

openai/prm800k has 2.1k stars on GitHub.

Where can I find prm800k?

openai/prm800k is on GitHub at https://github.com/openai/prm800k.

← all repositories

openai/prm800k

800,000 human verdicts on where math LLMs lose the plot

A dataset that labels every step of model-generated math solutions so verifiers can spot exactly where reasoning derails, not just whether the final answer is wrong.

★2.1k stars Python Data Tooling LLMOps · Eval

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does PRM800K is the data release behind OpenAI’s Let’s Verify Step by Step paper. It packs 800,000 human correctness labels applied to individual steps of LLM-generated solutions for the MATH dataset, plus the labeler instructions, a nonstandard train/test split, and a Python grader that leans on sympy to check symbolic answer equivalence.

The interesting bit Instead of only scoring final answers, the dataset captures where a model first goes wrong—sometimes recording multiple alternative completions at a single step so verifiers can learn to prune bad reasoning mid-trajectory. The authors also shuffled 4,500 original MATH test problems into training, leaving a mere 500 problems for actual evaluation.

Key highlights

Per-step ratings (-1, 0, +1) across two collection phases, with quality-control questions and generation metadata
Phase 1 includes alternative step completions; phase 2 adds them only after the first detected mistake
Conservative answer grader built on Hendrycks’ normalization logic and sympy, openly admitting it may reject correct answers
Pre-generated solutions scored by the project’s PRM, plus large-scale sampled outputs for benchmarking ORM against PRM
Raw labeler instruction documents from both phases

Caveats

Data is stored via Git LFS, so the JSONL files aren’t present in a plain repository clone
The nonstandard MATH split holds out only 500 test problems, which limits comparability with results using the official split
The bundled grader is explicitly conservative and may occasionally mark correct answers wrong

Verdict Essential if you’re building process reward models or studying where chain-of-thought reasoning fails in math. Skip it if you want plug-and-play model weights—this is raw research data, evaluation scripts, and grading logic, not a finished verifier.

Frequently asked

What is openai/prm800k?: A dataset that labels every step of model-generated math solutions so verifiers can spot exactly where reasoning derails, not just whether the final answer is wrong.
Is prm800k open source?: Yes — openai/prm800k is open source, released under the MIT license.
What language is prm800k written in?: openai/prm800k is primarily written in Python.
How popular is prm800k?: openai/prm800k has 2.1k stars on GitHub.
Where can I find prm800k?: openai/prm800k is on GitHub at https://github.com/openai/prm800k.