A transformer that builds 3D maps while you walk, no do-overs
LingBot-Map reconstructs scenes from streaming video in one forward pass, handling 10,000+ frames without iterative optimization.

What it does
LingBot-Map takes a stream of RGB frames and reconstructs camera poses and dense 3D geometry on the fly. It runs feed-forward at roughly 20 FPS on 518×378 video, using a paged KV cache to keep memory bounded across long sequences. The interactive demo spins up a browser viewer at localhost:8080; an offline pipeline handles sequences too large for live visualization, like a 25,000-frame indoor walkthrough.
The interesting bit
The architecture treats 3D reconstruction as a streaming attention problem. It anchors geometric context in world coordinates, maintains a pose-reference window, and corrects long-range drift through trajectory memory — all within a single transformer pass. No bundle adjustment loops, no state resets by default. The authors also fixed a FlashInfer KV cache bug in late April where non-keyframes were silently cached when using --keyframe_interval > 1.
Key highlights
- Three model variants on HuggingFace and ModelScope:
lingbot-map-longfor long sequences (recommended), a balanced general checkpoint, and a stage-1 checkpoint compatible with VGGT bidirectional inference. - Windowed inference mode with configurable overlap for sequences beyond 3,000 frames; keyframe subsampling to stretch past the 320-view training limit.
- Sky masking via an ONNX segmentation model, with automatic caching of masks for reuse.
- Evaluation scripts released for KITTI and Oxford Spires; benchmarks for TUM-D, 7-scenes, ETH3D, Tanks and Temples, and NRGBD are still pending.
Caveats
- Requires PyTorch 2.8.0 with CUDA 12.8 if you want the offline batch renderer without building NVIDIA Kaolin from source;
demo.pyalone is more flexible. - Inference range is bounded by training coverage — wander too far and poses may collapse unless you switch to windowed mode.
- Outdoor long-video, aerial, and “LingBot-World” demos are listed as TODO but not yet released.
Verdict Worth a look if you’re building real-time SLAM, robotics perception, or video-to-3D pipelines and want to skip the optimization loop. Skip it if you need guaranteed metric accuracy on out-of-distribution trajectories or a fully baked evaluation suite today.