NVIDIA's factory for fake data that doesn't feel fake
A Python framework that treats synthetic dataset generation like a proper data pipeline, not a chatbot prompt.

What it does
NeMo Data Designer generates synthetic datasets from scratch or seed data using a mix of statistical samplers, LLM calls, and validation rules. You define columns, dependencies between them, and quality checks — then it orchestrates the generation. Think of it as a data build tool where some steps are LLM prompts and others are categorical distributions.
The interesting bit
The async engine overlaps independent columns and adapts concurrency per provider-model pair, which is the kind of optimization that only matters once you’re burning through 2.6 trillion tokens. The framework also includes an agent skill for Claude Code that lets you describe a dataset in natural language and have the agent design the schema, validators, and generation logic.
Key highlights
- Dependency-aware generation: columns can reference other columns via Jinja-style templating (
{{ product_category }}) - Built-in validators in Python, SQL, or custom remote endpoints
- LLM-as-a-judge scoring for quality assessment
- Preview mode for rapid iteration before full-scale runs
- Supports NVIDIA Build API, OpenAI, and OpenRouter out of the box
- CLI for provider and model configuration
Caveats
- The async engine is now default but still transitional; the README warns you may need to fall back to
DATA_DESIGNER_ASYNC_ENGINE=0and file an issue - Documentation is mid-migration from MkDocs to Fern, so contributors should ignore generated artifacts and edit sources under
docs/ - Telemetry is on by default (model names and token counts only); opt-out requires setting
NEMO_TELEMETRY_ENABLED=false - NVIDIA Build endpoint is explicitly marked “evaluation and testing only — not for production”
Verdict
Worth a look if you’re building training datasets at scale and need more structure than “ask ChatGPT 10,000 times.” Skip it if your synthetic data needs are ad-hoc or you can’t stomach another NVIDIA ecosystem dependency.