← all repositories

hkust-nlp/deita

Deita is an open-source toolkit for data-efficient instruction tuning that selects high-quality training data for large language model alignment.

deita
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

The project provides automated data selection methods for instruction tuning of LLMs, aiming to identify optimal training samples that maximize model performance with minimal data. It includes scorer models (complexity and quality scorers) and complete pipelines that can select a high-quality 6k or 10k subset from larger datasets. The released datasets have been used to train models like Zephyr Gemma.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.