← all repositories
NVIDIA/cosmos

NVIDIA's omnimodal world model: one transformer, two personalities

Cosmos 3 tries to unify video generation, robot action prediction, and physical reasoning inside a single 16B–64B Mixture-of-Transformers architecture.

cosmos
Velocity · 7d
+18
★ / day
Trend
steady
star history

What it does

Cosmos 3 is a family of world models that can either reason about physical scenes (captioning, temporal localization, next-action prediction) or generate multimodal outputs (images, video, synchronized sound, robot action trajectories). It runs in two modes: Reasoner for text outputs from vision+language inputs, and Generator for producing video, audio, and action sequences conditioned on text, images, video, or action arrays. NVIDIA provides checkpoints from 16B to 64B parameters, including task-specialized variants for text-to-image, image-to-video, and DROID robot policy learning.

The interesting bit

The architecture shares a single Mixture-of-Transformers backbone between autoregressive reasoning and diffusion-based generation, using the same 3D rotary position embeddings across modalities. That means the same model weights (in theory) handle “describe this robot video” and “predict how this robot video continues” — modalities most projects split across entirely separate models.

Key highlights

  • Dual runtime surfaces: Reasoner (text out) and Generator (vision/sound/action out) with different attention patterns but shared architecture
  • Broad action conditioning: Supports camera motion (9D), autonomous vehicle (9D), egocentric motion (57D), single-arm robots like DROID (10D), dual-arm (20D), and humanoid AgiBot (29D)
  • Multiple serving paths: Diffusers/Transformers for research, vLLM-Omni and vLLM for OpenAI-compatible production serving
  • Generation controls: Resolutions from 256p to 720p, frame rates 10–30 FPS, up to 300 frames, multiple aspect ratios
  • Prompt upsampling built in: Short descriptions get expanded into dense structured prompts automatically

Caveats

  • Linux and NVIDIA Ampere/Hopper/Blackwell only; no Windows or older GPU support mentioned
  • BF16 precision tested, but no FP8 or FP16 guidance visible
  • Post-training recipes and Cosmos Framework adaptation workflows marked “[Coming Soon]”
  • “Seamlessly unifies” is NVIDIA’s phrasing — the README doesn’t quantify how much the shared backbone actually helps versus separate specialist models

Verdict

Worth exploring if you’re building physical AI (robotics, AV simulation, synthetic training data) and want one model family that can both interpret and generate worlds. Skip if you need lightweight CPU inference or non-NVIDIA hardware — this is firmly a datacenter-GPU platform.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.