Is llm-datasets open source?

Yes — mlabonne/llm-datasets is an open-source project tracked on heatdrop.

How popular is llm-datasets?

mlabonne/llm-datasets has 4.7k stars on GitHub.

Where can I find llm-datasets?

mlabonne/llm-datasets is on GitHub at https://github.com/mlabonne/llm-datasets.

← all repositories

mlabonne/llm-datasets

A shopping list for LLM fine-tuners who hate data archaeology

A curated, license-checked inventory of post-training datasets for math, code, science, and instruction following.

★4.7k stars Data Tooling Learning

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository is a hand-maintained catalog of datasets used for supervised fine-tuning (SFT) and other post-training stages. It organizes them by domain—general, math, science, code, instruction following—and tags each entry with size, whether it includes reasoning traces, and licensing notes. The author also includes a short rubric for what makes a dataset “good”: accuracy, diversity, and complexity.

The interesting bit

The curation is opinionated in a useful way. Entries note whether a dataset is synthetic, distilled from which teacher model, and whether responses are verified (e.g., math with solvers, code with unit tests). The author flags non-permissive licenses explicitly—rare diligence in a space where “open” often means “check the fine print.”

Key highlights

Covers specialized domains often underserved: formal math proofs (Lean 4), competitive programming, SWE trajectories, GPQA-style science questions
Tracks the “reasoning” trend: tags datasets with/without thinking traces, reflecting the post-DeepSeek shift in training recipes
Includes scale context: dataset sizes range from ~29K (Codeforces) to 15.87M (Nemotron-Cascade-2)
Permissive-license bias: defaults to Apache 2.0/MIT/CC-BY unless noted otherwise
Links to reproductions and ablations (e.g., open-perfectblend recreating a published mixture)

Caveats

No code or tools for actually mixing or filtering these datasets—this is purely a reference list
Some entries are truncated or incomplete in the README (e.g., tulu-3-sft-personas cuts mid-citation)
Date labels on datasets appear to include future dates (“Mar 2026”), suggesting either planned releases or a quirk in how entries are dated

Verdict

Worth bookmarking if you’re building training mixtures and tired of rediscovering the same five Hugging Face datasets. Not a substitute for actually inspecting data quality yourself—no curation replaces a histogram and a suspicious sample read.

Frequently asked

What is mlabonne/llm-datasets?: A curated, license-checked inventory of post-training datasets for math, code, science, and instruction following.
Is llm-datasets open source?: Yes — mlabonne/llm-datasets is an open-source project tracked on heatdrop.
How popular is llm-datasets?: mlabonne/llm-datasets has 4.7k stars on GitHub.
Where can I find llm-datasets?: mlabonne/llm-datasets is on GitHub at https://github.com/mlabonne/llm-datasets.