hkust-nlp/deita
Deita is an open-source toolkit for data-efficient instruction tuning that selects high-quality training data for large language model alignment.

Velocity · 7d
+0.6
★ / day
Trend
→steady
star history
The project provides automated data selection methods for instruction tuning of LLMs, aiming to identify optimal training samples that maximize model performance with minimal data. It includes scorer models (complexity and quality scorers) and complete pipelines that can select a high-quality 6k or 10k subset from larger datasets. The released datasets have been used to train models like Zephyr Gemma.