Is VoxFormer open source?

Yes — NVlabs/VoxFormer is an open-source project tracked on heatdrop.

What language is VoxFormer written in?

NVlabs/VoxFormer is primarily written in Python.

How popular is VoxFormer?

NVlabs/VoxFormer has 1.2k stars on GitHub.

Where can I find VoxFormer?

NVlabs/VoxFormer is on GitHub at https://github.com/NVlabs/VoxFormer.

← all repositories

NVlabs/VoxFormer

Filling in the occluded bits, no LiDAR required

VoxFormer generates dense 3D semantic scenes from plain 2D images by predicting the geometry and labels of space the camera never saw.

★1.2k stars Python Computer Vision Domain Apps

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

VoxFormer takes RGB images—single or multiple—and outputs a dense 3D semantic occupancy grid, inferring the shape and label of every voxel, including the parts hidden behind other objects. It runs a two-stage transformer pipeline built on ResNet50 features and an off-the-shelf depth predictor. The system currently targets the SemanticKITTI benchmark, where it holds the top camera-only spot.

The interesting bit

Instead of naïvely projecting every 2D feature into 3D space and hoping for the best, the model starts with a sparse set of voxel queries that correspond only to visible, occupied surfaces derived from estimated depth. It then uses a masked-autoencoder-style self-attention stage to propagate those reliable anchors into the occluded void. The insight is straightforward: 2D pixels only carry trustworthy information about what is actually visible, so anchor there first, then densify.

Key highlights

Ranked #1 on camera-only 3D semantic scene completion for SemanticKITTI at the time of publication, hitting 44.15% IoU and 13.35% mIoU.
Relative improvements of 20.0% in geometry and 18.1% in semantics over prior state of the art, while cutting training GPU memory by roughly 45% to under 16 GB.
Two-stage design: class-agnostic sparse query proposal via depth-corrected occupancy, followed by deformable self-attention to complete the dense volume.
Supports both monocular and multi-view inputs; the July 2023 update added 3D deformable attention for slightly better performance.
Non-commercial NVIDIA license (NC) with pretrained weights available under CC-BY-NC-SA-4.0.

Caveats

Dataset support is currently limited to SemanticKITTI; promised KITTI-360 and nuScenes integrations remain unchecked on the roadmap.
The codebase is explicitly non-commercial, which limits deployment in most products without a separate license from NVIDIA.

Verdict

Worth a look if you research camera-only 3D perception or occupancy prediction for autonomous driving. Skip it if you need a production-ready, commercially licensed drop-in for multi-dataset fleets.

Frequently asked

What is NVlabs/VoxFormer?: VoxFormer generates dense 3D semantic scenes from plain 2D images by predicting the geometry and labels of space the camera never saw.
Is VoxFormer open source?: Yes — NVlabs/VoxFormer is an open-source project tracked on heatdrop.
What language is VoxFormer written in?: NVlabs/VoxFormer is primarily written in Python.
How popular is VoxFormer?: NVlabs/VoxFormer has 1.2k stars on GitHub.
Where can I find VoxFormer?: NVlabs/VoxFormer is on GitHub at https://github.com/NVlabs/VoxFormer.