Is deita open source?

Yes — hkust-nlp/deita is open source, released under the Apache-2.0 license.

What language is deita written in?

hkust-nlp/deita is primarily written in Python.

How popular is deita?

hkust-nlp/deita has 600 stars on GitHub.

Where can I find deita?

hkust-nlp/deita is on GitHub at https://github.com/hkust-nlp/deita.

← all repositories

hkust-nlp/deita

Why fine-tune on 200K samples when 6K will do?

Deita automatically selects tiny, high-quality instruction-tuning subsets so you can train alignment models on 6K–10K examples instead of hundreds of thousands.

★600 stars Python Language Models Data Tooling ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Deita provides automatic data-selection pipelines, pre-filtered datasets, and resulting chat models for LLM alignment. It ships trained scorer models that rate instruction-following examples by complexity and quality, then uses them to carve massive pools into lightweight subsets—as small as 6K samples—for supervised fine-tuning and DPO. The released chat models are built on Mistral-7B and LLaMA-13B bases.

The interesting bit

The project bets that dataset curation matters more than dataset volume. Its Mistral-based DEITA-7B-v1.0 was trained on just 6K SFT plus 10K DPO examples, yet the README tables place it within striking distance of Zephyr-7B-β and Starling-7B—rivals that consumed hundreds of thousands of examples. That is a lot of GPU hours left on the table.

Key highlights

Ships ready-made 6K and 10K SFT datasets, plus a 300K “SOTA pool” for custom selection.
Includes trained complexity and quality scorers to rank your own instruction data.
Released chat models span Mistral-7B and LLaMA-13B variants.
Pipelines are configurable and support VLLM for faster scoring inference.
Licensing is split: datasets are MIT, models are Apache 2.0 or LLaMA 2, and scorer weights are under the LLaMA license.

Verdict

Worth a look if you are building alignment pipelines and suspect most of your instruction dataset is noise. Look elsewhere if you need a general pre-training or RL framework—this is strictly about curating and tuning instruction data.

Frequently asked

What is hkust-nlp/deita?: Deita automatically selects tiny, high-quality instruction-tuning subsets so you can train alignment models on 6K–10K examples instead of hundreds of thousands.
Is deita open source?: Yes — hkust-nlp/deita is open source, released under the Apache-2.0 license.
What language is deita written in?: hkust-nlp/deita is primarily written in Python.
How popular is deita?: hkust-nlp/deita has 600 stars on GitHub.
Where can I find deita?: hkust-nlp/deita is on GitHub at https://github.com/hkust-nlp/deita.