← all repositories
NVIDIA-NeMo/DataDesigner

NVIDIA's factory for fake data that doesn't feel fake

A Python framework that treats synthetic dataset generation like a proper data pipeline, not a chatbot prompt.

DataDesigner
Velocity · 7d
+8.4
★ / day
Trend
steady
star history

What it does

NeMo Data Designer generates synthetic datasets from scratch or seed data using a mix of statistical samplers, LLM calls, and validation rules. You define columns, dependencies between them, and quality checks — then it orchestrates the generation. Think of it as a data build tool where some steps are LLM prompts and others are categorical distributions.

The interesting bit

The async engine overlaps independent columns and adapts concurrency per provider-model pair, which is the kind of optimization that only matters once you’re burning through 2.6 trillion tokens. The framework also includes an agent skill for Claude Code that lets you describe a dataset in natural language and have the agent design the schema, validators, and generation logic.

Key highlights

  • Dependency-aware generation: columns can reference other columns via Jinja-style templating ({{ product_category }})
  • Built-in validators in Python, SQL, or custom remote endpoints
  • LLM-as-a-judge scoring for quality assessment
  • Preview mode for rapid iteration before full-scale runs
  • Supports NVIDIA Build API, OpenAI, and OpenRouter out of the box
  • CLI for provider and model configuration

Caveats

  • The async engine is now default but still transitional; the README warns you may need to fall back to DATA_DESIGNER_ASYNC_ENGINE=0 and file an issue
  • Documentation is mid-migration from MkDocs to Fern, so contributors should ignore generated artifacts and edit sources under docs/
  • Telemetry is on by default (model names and token counts only); opt-out requires setting NEMO_TELEMETRY_ENABLED=false
  • NVIDIA Build endpoint is explicitly marked “evaluation and testing only — not for production”

Verdict

Worth a look if you’re building training datasets at scale and need more structure than “ask ChatGPT 10,000 times.” Skip it if your synthetic data needs are ad-hoc or you can’t stomach another NVIDIA ecosystem dependency.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.