A shopping list for LLM fine-tuners who hate data archaeology
A curated, license-checked inventory of post-training datasets for math, code, science, and instruction following.

What it does
This repository is a hand-maintained catalog of datasets used for supervised fine-tuning (SFT) and other post-training stages. It organizes them by domain—general, math, science, code, instruction following—and tags each entry with size, whether it includes reasoning traces, and licensing notes. The author also includes a short rubric for what makes a dataset “good”: accuracy, diversity, and complexity.
The interesting bit
The curation is opinionated in a useful way. Entries note whether a dataset is synthetic, distilled from which teacher model, and whether responses are verified (e.g., math with solvers, code with unit tests). The author flags non-permissive licenses explicitly—rare diligence in a space where “open” often means “check the fine print.”
Key highlights
- Covers specialized domains often underserved: formal math proofs (Lean 4), competitive programming, SWE trajectories, GPQA-style science questions
- Tracks the “reasoning” trend: tags datasets with/without thinking traces, reflecting the post-DeepSeek shift in training recipes
- Includes scale context: dataset sizes range from ~29K (Codeforces) to 15.87M (Nemotron-Cascade-2)
- Permissive-license bias: defaults to Apache 2.0/MIT/CC-BY unless noted otherwise
- Links to reproductions and ablations (e.g.,
open-perfectblendrecreating a published mixture)
Caveats
- No code or tools for actually mixing or filtering these datasets—this is purely a reference list
- Some entries are truncated or incomplete in the README (e.g.,
tulu-3-sft-personascuts mid-citation) - Date labels on datasets appear to include future dates (“Mar 2026”), suggesting either planned releases or a quirk in how entries are dated
Verdict
Worth bookmarking if you’re building training mixtures and tired of rediscovering the same five Hugging Face datasets. Not a substitute for actually inspecting data quality yourself—no curation replaces a histogram and a suspicious sample read.