princeton-nlp/LESS
A data selection framework that identifies influential training examples to improve specific LLM capabilities through targeted instruction tuning.

LESS provides a method to select the most impactful training data for LLM instruction tuning by building a gradient datastore and scoring examples based on their influence on target capabilities. The pipeline involves warmup training, gradient collection, and influence-based selection across datasets like Flan v2, COT, Dolly, and Open Assistant. The selected data is then used for fine-tuning to induce specific capabilities in models like Llama and Mistral.