THU-SI/Spatial-MLLM
Spatial-MLLM enhances existing video multimodal LLMs with visual-based spatial intelligence capabilities.

Velocity · 7d
+1.2
★ / day
Trend
→steady
star history
Spatial-MLLM is a method that significantly enhances the visual-based spatial intelligence of existing video multimodal large language models. The project provides supervised fine-tuning training code, evaluation code, and pre-trained models for spatial reasoning tasks. It achieves state-of-the-art performance on benchmarks like VSI-Bench and releases models trained on datasets such as Spatial-MLLM-120k.