A transformer that looks at photos and builds 3D scenes in under a second
VGGT turns one image—or a hundred—into camera poses, depth maps, point clouds, and trackable 3D points without any optimization loop.

What it does
VGGT is a feed-forward vision transformer that reconstructs 3D scene geometry from unstructured images. Feed it one photo, a handful, or hundreds; it spits out camera intrinsics and extrinsics, depth maps, point maps, and 3D point tracks. No COLMAP-style iterative optimization, no painstaking feature matching. The authors claim inference runs in under a second, though rendering the results afterward can drag.
The interesting bit
The model was never trained on single-view reconstruction, yet it handles it zero-shot—no image duplication tricks, just direct inference from one view’s tokens. It also exports straight to COLMAP format, so you can pipe its output into Gaussian splatting pipelines like gsplat without touching a traditional SfM pipeline yourself.
Key highlights
- One forward pass yields cameras, depth, point maps, and point tracks
- Supports masking unwanted pixels (sky, reflections) with coarse 0/1 masks—no precise segmentation needed
- Includes Gradio web demo and Viser 3D viewer for interactive exploration
- Training code and fine-tuning examples released in the
trainingfolder - Commercial-use checkpoint available via application (LLaMA-style approval), though original checkpoint remains non-commercial
Caveats
- Requires a beefy GPU; bfloat16 needs Ampere (Compute Capability 8.0+) for full precision
- The “under 1 second” claim is for inference only—visualization can take tens of seconds, especially with many images
- Bundle adjustment is optional but recommended; reduced parameters trade robustness for speed
Verdict
Worth a look if you’re building 3D pipelines and want to skip the COLMAP dance. Skip it if you’re on CPU-only hardware or need guaranteed metric-scale accuracy without calibration.