One image in, explorable 3D world out — via video diffusion
DimensionX turns a single photo into controllable camera videos, then extracts actual 3D/4D scenes by treating space and time as separate knobs.

What it does DimensionX starts with one image and generates videos with precise camera control — orbit left, pan up, dolly in, etc. — using fine-tuned LoRA adapters on top of CogVideoX. From those generated frame sequences, it reconstructs explorable 3D scenes (via Gaussian Splatting) and 4D scenes (time-varying 3D) by treating spatial motion and temporal dynamics as independently controllable factors.
The interesting bit The authors’ “ST-Director” decouples space and time in video diffusion instead of letting the model freely mix them. This is the boring-sounding trick that makes the whole pipeline work: by forcing the model to learn separate “directors” for camera motion and scene dynamics, they can recombine them to recover geometry that actually holds up under novel viewpoints.
Key highlights
- 12 camera LoRAs covering 6 degrees of freedom (± each axis), plus 4 orbital presets
- 360° orbit model generates 145 frames (~18s) in ~6 min on a single A6000 (30.5 GB VRAM)
- Standard LoRA inference: ~3 min on A100/A800 for 48-frame, 6-second clips (26.3 GB VRAM)
- Full training pipeline, datasets, and 3DGS optimization code open-sourced as of Oct 2025
- Built on CogVideoX-5B-I2V; authors note HunyuanVideo and WanX as alternative bases
Caveats
- 4D generation’s “identity-preserving denoising code” is still on the todo list; not yet released
- README setup involves multiple conda envs, manual checkpoint shuffling, and SAT/THUDM-specific weight formats
- Inference “better result” note suggests running a VLM captioning pass first — not quite one-click
Verdict Worth a look if you’re doing neural rendering, single-view reconstruction, or controllable video generation research. Skip if you need polished 4D output today or lack the GPU headroom for 30+ GB models.