← all repositories
manycore-research/SpatialLM

A 0.5B-parameter model that reads messy room scans like a floor plan

SpatialLM turns noisy point clouds from any source—phone video, RGBD, LiDAR—into structured 3D layouts with walls, doors, and furniture boxes.

4.6k stars Python Language ModelsComputer Vision
SpatialLM
Velocity · 7d
+10
★ / day
Trend
steady
star history

What it does

SpatialLM ingests 3D point clouds and outputs structured indoor geometry: walls, doors, windows, and oriented bounding boxes for objects with semantic labels. It accepts data from monocular video, RGBD images, or LiDAR—no specialized capture rig required. The repo includes inference scripts, evaluation tools, and fine-tuning instructions via FINETUNE.md.

The interesting bit

The twist is treating this as a language modeling problem. SpatialLM1.1 uses an LLM backbone (Llama-1B or Qwen-0.5B) and a point encoder called Sonata, letting users prompt with specific furniture categories—“bed nightstand”—and get filtered detections back. The model was trained on a synthetic dataset of 12,328 scenes (54,778 rooms) designed by professional 3D artists, then fine-tuned on real benchmarks like Structured3D.

Key highlights

  • Four model sizes on Hugging Face, down to 0.5B parameters
  • Three task modes: full structured reconstruction, layout-only (walls/doors/windows), or object detection
  • 59 furniture categories available for user-specified filtering in v1.1
  • Test set of 107 point clouds reconstructed from monocular RGB video via MASt3R-SLAM, intentionally noisy to stress-test robustness
  • Benchmarks include corrected metrics after a community bugfix in the eval script

Caveats

  • Input point clouds must be axis-aligned with z-up; the README calls this “crucial for consistency”
  • Installation involves building wheels for torchsparse or flash-attn, which “will take a while”
  • The README notes v1.1 doubles point cloud resolution over v1.0, but doesn’t quantify the memory or speed tradeoff

Verdict

Worth a look if you’re doing embodied robotics, indoor navigation, or any pipeline that needs to go from raw 3D capture to structured scene graphs. Skip it if you need outdoor scenes or real-time on-edge inference—the smallest model is 0.5B parameters and the setup assumes CUDA 12.4.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.