← all repositories

meta-llama/synthetic-data-kit

A CLI tool for generating synthetic training datasets for fine-tuning LLMs using modular ingestion, creation, curation, and format-conversion workflows.

synthetic-data-kit
Velocity · 7d
+3.6
★ / day
Trend
steady
star history

Synthetic Data Kit is a Python-based tool maintained by Meta for creating high-quality synthetic datasets tailored for LLM fine-tuning. It provides a modular four-command CLI workflow that uses LLMs (vLLM or external APIs) to generate reasoning traces, QA pairs, and other training examples. The toolkit includes ingestion for various file formats, example creation with chain-of-thought support, curation using Llama-as-judge for quality filtering, and format conversion to match downstream fine-tuning requirements.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.