← all repositories
wenqsun/DimensionX

One image in, explorable 3D world out — via video diffusion

DimensionX turns a single photo into controllable camera videos, then extracts actual 3D/4D scenes by treating space and time as separate knobs.

DimensionX
Velocity · 7d
+2.3
★ / day
Trend
steady
star history

What it does DimensionX starts with one image and generates videos with precise camera control — orbit left, pan up, dolly in, etc. — using fine-tuned LoRA adapters on top of CogVideoX. From those generated frame sequences, it reconstructs explorable 3D scenes (via Gaussian Splatting) and 4D scenes (time-varying 3D) by treating spatial motion and temporal dynamics as independently controllable factors.

The interesting bit The authors’ “ST-Director” decouples space and time in video diffusion instead of letting the model freely mix them. This is the boring-sounding trick that makes the whole pipeline work: by forcing the model to learn separate “directors” for camera motion and scene dynamics, they can recombine them to recover geometry that actually holds up under novel viewpoints.

Key highlights

  • 12 camera LoRAs covering 6 degrees of freedom (± each axis), plus 4 orbital presets
  • 360° orbit model generates 145 frames (~18s) in ~6 min on a single A6000 (30.5 GB VRAM)
  • Standard LoRA inference: ~3 min on A100/A800 for 48-frame, 6-second clips (26.3 GB VRAM)
  • Full training pipeline, datasets, and 3DGS optimization code open-sourced as of Oct 2025
  • Built on CogVideoX-5B-I2V; authors note HunyuanVideo and WanX as alternative bases

Caveats

  • 4D generation’s “identity-preserving denoising code” is still on the todo list; not yet released
  • README setup involves multiple conda envs, manual checkpoint shuffling, and SAT/THUDM-specific weight formats
  • Inference “better result” note suggests running a VLM captioning pass first — not quite one-click

Verdict Worth a look if you’re doing neural rendering, single-view reconstruction, or controllable video generation research. Skip if you need polished 4D output today or lack the GPU headroom for 30+ GB models.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.