Is gpt-2-output-dataset open source?

Yes — openai/gpt-2-output-dataset is open source, released under the MIT license.

What language is gpt-2-output-dataset written in?

openai/gpt-2-output-dataset is primarily written in Python.

How popular is gpt-2-output-dataset?

openai/gpt-2-output-dataset has 2k stars on GitHub.

Where can I find gpt-2-output-dataset?

openai/gpt-2-output-dataset is on GitHub at https://github.com/openai/gpt-2-output-dataset.

← all repositories

openai/gpt-2-output-dataset

OpenAI's own AI detection homework, published

A release of GPT-2 outputs designed to make the model detectable—part research dataset, part admission that this problem needs outside help.

★2k stars Python Data Tooling Language Models LLMOps · Eval

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository distributes millions of GPT-2-generated text samples alongside the real WebText articles they were trained on. It includes outputs from every model size (117M to 1.5B parameters), generated both randomly and with Top-K 40 truncation, plus a finetuned variant that spits out Amazon reviews. OpenAI wants researchers to study detection, biases, and whatever else the data reveals.

The interesting bit

The dataset doubles as a benchmark with baseline detection scores already baked in: mid-90% accuracy for Top-K 40 outputs, but only mid-70s to high-80s for unrestricted random sampling. OpenAI also notes—almost in passing—that finetuning lets adversaries evade detection, which is less a feature and more a warning about the arms race they were already anticipating in 2019.

Key highlights

250K real WebText documents plus 250K generated samples per model, per generation strategy
Train/valid/test splits provided; includes a finetuned Amazon review model for adversarial detection research
Baseline detection code and analysis included (baseline.py, detection.md)
Data migrated from Google Cloud to Azure blob storage; download_dataset.py script provided
Direct data removal contact for WebText contributors (webtextdata@openai.com)

Caveats

The README is sparse on methodology details—how exactly the “initial analysis” was conducted is left to the linked files
No explicit license mentioned in the provided source
Storage URLs have changed once already; links may rot

Verdict

Researchers building or evaluating AI text detectors should start here—it’s the ground truth for a foundational model. Everyone else can skip; this is raw data and a few scripts, not a tool you run out of the box.

Frequently asked

What is openai/gpt-2-output-dataset?: A release of GPT-2 outputs designed to make the model detectable—part research dataset, part admission that this problem needs outside help.
Is gpt-2-output-dataset open source?: Yes — openai/gpt-2-output-dataset is open source, released under the MIT license.
What language is gpt-2-output-dataset written in?: openai/gpt-2-output-dataset is primarily written in Python.
How popular is gpt-2-output-dataset?: openai/gpt-2-output-dataset has 2k stars on GitHub.
Where can I find gpt-2-output-dataset?: openai/gpt-2-output-dataset is on GitHub at https://github.com/openai/gpt-2-output-dataset.