magpie-align/magpie
Research-grade pipeline for generating high-quality synthetic alignment data by prompting aligned LLMs with their native pre-query templates.

Magpie generates training data for LLM alignment and fine-tuning by exploiting the prompt templates of aligned LLMs to produce both user queries and model responses without manual annotation. It requires no seed questions or prompt engineering, making synthetic data generation more scalable. The project provides generated datasets (1M+ examples) from models like Llama-3.3, QwQ, and Skywork-o1, specifically targeting supervised fine-tuning workflows.