NVIDIA's motion diffusion model actually ships with a timeline editor
Kimodo generates 3D human and robot motion from text prompts plus precise kinematic constraints—keyframes, end-effector positions, 2D paths—rather than hoping the model guesses your intent.

What it does Kimodo is a diffusion model trained on 700 hours of commercially-friendly motion capture. It generates 3D motion for human and robot skeletons (SOMA, Unitree G1, SMPL-X) controlled by text prompts and an unusually broad set of kinematic constraints: full-body pose keyframes, end-effector positions/rotations, 2D root paths, and waypoints. The repo includes inference code, a CLI, a web-based interactive demo with timeline editing, and a benchmark suite built on the BONES-SEED dataset.
The interesting bit Most motion generation tools treat text as the only steering wheel. Kimodo adds a full constraint stack—pose keyframes, hand/foot targets, ground-plane paths—and exposes it through a Gradio-like web demo where you author motions on a multi-track timeline. The model also auto-downloads from Hugging Face, so you don’t wrestle with weights manually.
Key highlights
- Ships with six model variants across three skeletons (SOMA 77-joint, G1, SMPL-X), with RP models trained on 700h mocap recommended over the 288h SEED variants
- Interactive demo runs locally at
127.0.0.1:7860with real-time 3D preview, constraint editing, and export to NPZ/MuJoCo CSV/AMASS formats - CLI supports classifier-free guidance with separate weights for text vs. constraints, plus optional foot-skate cleanup post-processing
- VRAM requirement drops from ~17 GB to <3 GB by offloading text encoding to CPU via
TEXT_ENCODER_DEVICE=cpu - Includes a Motion Generation Benchmark with test cases and evaluation code for comparing constraint-following accuracy across models
Caveats
- Developed on Linux; Windows support exists but is less tested (Docker recommended)
- SMPL-X variant carries a stricter NVIDIA R&D Model license, unlike the Open Model license for SOMA and G1 variants
- A March 2026 breaking change switched SOMA models to a 77-joint skeleton (
somaskel77), so older integrations may need updating
Verdict Worth a look if you’re building animation pipelines, robotics simulators, or game tools where artists need precise control over generated motion—not just a lucky text prompt. Skip if you need real-time runtime generation or lightweight CPU-only inference; this is still research-grade diffusion with GPU appetite.