meta-llama/synthetic-data-kit
A CLI tool for generating synthetic training datasets for fine-tuning LLMs using modular ingestion, creation, curation, and format-conversion workflows.

Synthetic Data Kit is a Python-based tool maintained by Meta for creating high-quality synthetic datasets tailored for LLM fine-tuning. It provides a modular four-command CLI workflow that uses LLMs (vLLM or external APIs) to generate reasoning traces, QA pairs, and other training examples. The toolkit includes ingestion for various file formats, example creation with chain-of-thought support, curation using Llama-as-judge for quality filtering, and format conversion to match downstream fine-tuning requirements.