Is GraphGen open source?

Yes — InternScience/GraphGen is open source, released under the Apache-2.0 license.

What language is GraphGen written in?

InternScience/GraphGen is primarily written in Python.

How popular is GraphGen?

InternScience/GraphGen has 1.1k stars on GitHub.

Where can I find GraphGen?

InternScience/GraphGen is on GitHub at https://github.com/InternScience/GraphGen.

← all repositories

InternScience/GraphGen

Map what your LLM doesn't know, then teach it

GraphGen constructs fine-grained knowledge graphs from source text to generate synthetic QA pairs that target the specific concepts and relationships your model is most likely to get wrong.

★1.1k stars Python Data Tooling Language Models ML Frameworks

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

GraphGen reads source text and builds a fine-grained knowledge graph, then uses expected calibration error to find the facts and relationships an LLM is least likely to know. It generates synthetic QA pairs—atomic, multi-hop, chain-of-thought, and even visual questions—that hammer on those long-tail weaknesses instead of rehashing common knowledge. The resulting datasets plug into standard fine-tuning tools like LLaMA-Factory and xtuner.

The interesting bit

The framework treats a model’s ignorance as a measurable resource: it maps knowledge gaps onto the graph, samples multi-hop neighborhoods to surface hidden connections, and controls question style so the same fact can be probed a dozen different ways. For pretraining, it can also reformulate existing corpora into diverse variants without adding a single new token, lifting downstream benchmark averages by roughly a point.

Key highlights

Supports a wide backend zoo: vLLM, SGLang, HuggingFace Transformers, Ollama, and standard HTTP APIs.
Ingests files, search results, Wikipedia, and scientific databases including NCBI, UniProt, and RNAcentral.
Outputs text and image QA formats, including chain-of-thought data synthesized via Leiden community detection on the graph.
Distributed pipeline built on Ray, with RocksDB for key-value storage and KuzuDB for graph queries.
Rephrase pipeline for pretraining that improves benchmarks with zero additional raw data, per the SlimPajama-6B evaluation table.

Caveats

SFT results are mixed: the GraphGen-tuned model trails the Qwen2.5-7B-Instruct baseline on general knowledge (CMMLU 73.6 vs. 75.8), so the approach appears more effective in specialized domains than broad commonsense reasoning.
The README notes that “over 50% SFT data” came from GraphGen, but it is unclear what the optimal blend is or whether those ratios transfer to other base models.

Verdict

A solid bet if you need structured, verifiable synthetic data for domain-specific fine-tuning or RLVR. Less compelling if you are tuning general-purpose chat models on broad conversational corpora.

Frequently asked

What is InternScience/GraphGen?: GraphGen constructs fine-grained knowledge graphs from source text to generate synthetic QA pairs that target the specific concepts and relationships your model is most likely to get wrong.
Is GraphGen open source?: Yes — InternScience/GraphGen is open source, released under the Apache-2.0 license.
What language is GraphGen written in?: InternScience/GraphGen is primarily written in Python.
How popular is GraphGen?: InternScience/GraphGen has 1.1k stars on GitHub.
Where can I find GraphGen?: InternScience/GraphGen is on GitHub at https://github.com/InternScience/GraphGen.