Is DataDesigner open source?

Yes — NVIDIA-NeMo/DataDesigner is open source, released under the Apache-2.0 license.

What language is DataDesigner written in?

NVIDIA-NeMo/DataDesigner is primarily written in Python.

How popular is DataDesigner?

NVIDIA-NeMo/DataDesigner has 2.1k stars on GitHub.

Where can I find DataDesigner?

NVIDIA-NeMo/DataDesigner is on GitHub at https://github.com/NVIDIA-NeMo/DataDesigner.

← all repositories

NVIDIA-NeMo/DataDesigner

NVIDIA's factory for fake data that doesn't feel fake

A Python framework that treats synthetic dataset generation like a proper data pipeline, not a chatbot prompt.

★2.1k stars Python Data Tooling Language Models LLMOps · Eval

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

NeMo Data Designer generates synthetic datasets from scratch or seed data using a mix of statistical samplers, LLM calls, and validation rules. You define columns, dependencies between them, and quality checks — then it orchestrates the generation. Think of it as a data build tool where some steps are LLM prompts and others are categorical distributions.

The interesting bit

The async engine overlaps independent columns and adapts concurrency per provider-model pair, which is the kind of optimization that only matters once you’re burning through 2.6 trillion tokens. The framework also includes an agent skill for Claude Code that lets you describe a dataset in natural language and have the agent design the schema, validators, and generation logic.

Key highlights

Dependency-aware generation: columns can reference other columns via Jinja-style templating ({{ product_category }})
Built-in validators in Python, SQL, or custom remote endpoints
LLM-as-a-judge scoring for quality assessment
Preview mode for rapid iteration before full-scale runs
Supports NVIDIA Build API, OpenAI, and OpenRouter out of the box
CLI for provider and model configuration

Caveats

The async engine is now default but still transitional; the README warns you may need to fall back to DATA_DESIGNER_ASYNC_ENGINE=0 and file an issue
Documentation is mid-migration from MkDocs to Fern, so contributors should ignore generated artifacts and edit sources under docs/
Telemetry is on by default (model names and token counts only); opt-out requires setting NEMO_TELEMETRY_ENABLED=false
NVIDIA Build endpoint is explicitly marked “evaluation and testing only — not for production”

Verdict

Worth a look if you’re building training datasets at scale and need more structure than “ask ChatGPT 10,000 times.” Skip it if your synthetic data needs are ad-hoc or you can’t stomach another NVIDIA ecosystem dependency.

Frequently asked

What is NVIDIA-NeMo/DataDesigner?: A Python framework that treats synthetic dataset generation like a proper data pipeline, not a chatbot prompt.
Is DataDesigner open source?: Yes — NVIDIA-NeMo/DataDesigner is open source, released under the Apache-2.0 license.
What language is DataDesigner written in?: NVIDIA-NeMo/DataDesigner is primarily written in Python.
How popular is DataDesigner?: NVIDIA-NeMo/DataDesigner has 2.1k stars on GitHub.
Where can I find DataDesigner?: NVIDIA-NeMo/DataDesigner is on GitHub at https://github.com/NVIDIA-NeMo/DataDesigner.