← all repositories
mlabonne/llm-datasets

A shopping list for LLM fine-tuners who hate data archaeology

A curated, license-checked inventory of post-training datasets for math, code, science, and instruction following.

4.6k stars Data ToolingLearning
llm-datasets
Velocity · 7d
+6.0
★ / day
Trend
steady
star history

What it does

This repository is a hand-maintained catalog of datasets used for supervised fine-tuning (SFT) and other post-training stages. It organizes them by domain—general, math, science, code, instruction following—and tags each entry with size, whether it includes reasoning traces, and licensing notes. The author also includes a short rubric for what makes a dataset “good”: accuracy, diversity, and complexity.

The interesting bit

The curation is opinionated in a useful way. Entries note whether a dataset is synthetic, distilled from which teacher model, and whether responses are verified (e.g., math with solvers, code with unit tests). The author flags non-permissive licenses explicitly—rare diligence in a space where “open” often means “check the fine print.”

Key highlights

  • Covers specialized domains often underserved: formal math proofs (Lean 4), competitive programming, SWE trajectories, GPQA-style science questions
  • Tracks the “reasoning” trend: tags datasets with/without thinking traces, reflecting the post-DeepSeek shift in training recipes
  • Includes scale context: dataset sizes range from ~29K (Codeforces) to 15.87M (Nemotron-Cascade-2)
  • Permissive-license bias: defaults to Apache 2.0/MIT/CC-BY unless noted otherwise
  • Links to reproductions and ablations (e.g., open-perfectblend recreating a published mixture)

Caveats

  • No code or tools for actually mixing or filtering these datasets—this is purely a reference list
  • Some entries are truncated or incomplete in the README (e.g., tulu-3-sft-personas cuts mid-citation)
  • Date labels on datasets appear to include future dates (“Mar 2026”), suggesting either planned releases or a quirk in how entries are dated

Verdict

Worth bookmarking if you’re building training mixtures and tired of rediscovering the same five Hugging Face datasets. Not a substitute for actually inspecting data quality yourself—no curation replaces a histogram and a suspicious sample read.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.