InternScience/GraphGen
Synthetic data generation framework that creates knowledge-graph-based training data to improve supervised fine-tuning of LLMs.

GraphGen is a data synthesis system designed to enhance LLM fine-tuning by generating high-quality training data from knowledge graphs. It provides knowledge-driven pipelines for creating question-answering pairs and other SFT data. The framework integrates with popular training frameworks like llama-factory and xtuner, and supports models including Qwen and LLaMA. Users can generate diverse synthetic samples through structured graph traversal and question generation, then directly apply them to fine-tune language models.