← all repositories
Robbyant/lingbot-map

A transformer that builds 3D maps while you walk, no do-overs

LingBot-Map reconstructs scenes from streaming video in one forward pass, handling 10,000+ frames without iterative optimization.

7.1k stars Python Computer Vision
lingbot-map
Velocity · 7d
+132
★ / day
Trend
steady
star history

What it does LingBot-Map takes a stream of RGB frames and reconstructs camera poses and dense 3D geometry on the fly. It runs feed-forward at roughly 20 FPS on 518×378 video, using a paged KV cache to keep memory bounded across long sequences. The interactive demo spins up a browser viewer at localhost:8080; an offline pipeline handles sequences too large for live visualization, like a 25,000-frame indoor walkthrough.

The interesting bit The architecture treats 3D reconstruction as a streaming attention problem. It anchors geometric context in world coordinates, maintains a pose-reference window, and corrects long-range drift through trajectory memory — all within a single transformer pass. No bundle adjustment loops, no state resets by default. The authors also fixed a FlashInfer KV cache bug in late April where non-keyframes were silently cached when using --keyframe_interval > 1.

Key highlights

  • Three model variants on HuggingFace and ModelScope: lingbot-map-long for long sequences (recommended), a balanced general checkpoint, and a stage-1 checkpoint compatible with VGGT bidirectional inference.
  • Windowed inference mode with configurable overlap for sequences beyond 3,000 frames; keyframe subsampling to stretch past the 320-view training limit.
  • Sky masking via an ONNX segmentation model, with automatic caching of masks for reuse.
  • Evaluation scripts released for KITTI and Oxford Spires; benchmarks for TUM-D, 7-scenes, ETH3D, Tanks and Temples, and NRGBD are still pending.

Caveats

  • Requires PyTorch 2.8.0 with CUDA 12.8 if you want the offline batch renderer without building NVIDIA Kaolin from source; demo.py alone is more flexible.
  • Inference range is bounded by training coverage — wander too far and poses may collapse unless you switch to windowed mode.
  • Outdoor long-video, aerial, and “LingBot-World” demos are listed as TODO but not yet released.

Verdict Worth a look if you’re building real-time SLAM, robotics perception, or video-to-3D pipelines and want to skip the optimization loop. Skip it if you need guaranteed metric accuracy on out-of-distribution trajectories or a fully baked evaluation suite today.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.