← all repositories
openai/gpt-2-output-dataset

OpenAI's own AI detection homework, published

A release of GPT-2 outputs designed to make the model detectable—part research dataset, part admission that this problem needs outside help.

gpt-2-output-dataset
Velocity · 7d
+0.8
★ / day
Trend
steady
star history

What it does

This repository distributes millions of GPT-2-generated text samples alongside the real WebText articles they were trained on. It includes outputs from every model size (117M to 1.5B parameters), generated both randomly and with Top-K 40 truncation, plus a finetuned variant that spits out Amazon reviews. OpenAI wants researchers to study detection, biases, and whatever else the data reveals.

The interesting bit

The dataset doubles as a benchmark with baseline detection scores already baked in: mid-90% accuracy for Top-K 40 outputs, but only mid-70s to high-80s for unrestricted random sampling. OpenAI also notes—almost in passing—that finetuning lets adversaries evade detection, which is less a feature and more a warning about the arms race they were already anticipating in 2019.

Key highlights

  • 250K real WebText documents plus 250K generated samples per model, per generation strategy
  • Train/valid/test splits provided; includes a finetuned Amazon review model for adversarial detection research
  • Baseline detection code and analysis included (baseline.py, detection.md)
  • Data migrated from Google Cloud to Azure blob storage; download_dataset.py script provided
  • Direct data removal contact for WebText contributors (webtextdata@openai.com)

Caveats

  • The README is sparse on methodology details—how exactly the “initial analysis” was conducted is left to the linked files
  • No explicit license mentioned in the provided source
  • Storage URLs have changed once already; links may rot

Verdict

Researchers building or evaluating AI text detectors should start here—it’s the ground truth for a foundational model. Everyone else can skip; this is raw data and a few scripts, not a tool you run out of the box.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.