From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

1University of Chinese Academy of Sciences, UCAS
2ETH Zürich
3Microsoft Research Asia
* Corresponding Author
Github arXiv 🤗 HuggingFace
A quick test for your spatial intelligence!

What's the length of the bed in meters?

Bed without reference

Would you change your mind after seeing this?

Bed with cat
This is a 1:3 miniature of a real bedroom; the actual length of the bed is 0.75 meters.

Gemini estimates the bed to be 1.9 meters long, citing the average dimensions of a double bed, even when presented with the contradictory visual evidence of the cat. This indicates that MLLMs rely heavily on linguistic priors for spatial reasoning, often disregarding critical visual cues.

Spatial Intelligence Mirage Analysis

The spatial intelligence observed on indoor benchmarks is a mirage: the significant and coherent performance gains on VSI-Bench[Yang et al. 2025] are largely a result of same-origin training data and linguistic priors in indoor scenes. These capabilities fail to transfer to open-world settings, where linguistic priors are weak.

Abstract

Project Teaser

While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence—crucial for robust and grounded AI systems—remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum—from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

Benchmark Construction

Tasks Overview

Evaluation on OpenBench

Methods Rank Avg. Relational (MCA) Static Metric (NA) Dynamic Metric (NA)
Rel. Dis.
Rel. Dir.
Qual. EM
Obj. Loc.
Abs. Dis.
Depth Count
Abs. Displ.
Abs. Speed
Quan. EM
Against Human on tiny
Human-level160.385.783.373.743.939.267.542.965.866.8
Gemini-2.5-Pro236.853.123.146.739.733.840.322.227.840.0
GPT-5427.937.530.840.035.325.312.89.231.433.0
Qwen3VL-32B-Instruct331.956.323.133.326.210.932.814.437.252.3
Closed-source Models
Gemini-2.5-Pro137.250.028.152.537.428.137.926.831.140.8
Gemini-2.5-Flash619.517.92.850.622.716.126.88.16.820.0
GPT-5229.734.433.149.532.523.720.910.533.830.6
GPT-4o525.930.829.142.222.927.021.617.515.528.8
Claude-3.7-Sonnet426.538.932.847.631.322.431.55.230.15.0
Doubao-Seed-1.6V327.335.924.144.016.618.938.725.731.89.2
Open-source Models
InternVL2-8B1424.535.131.740.821.817.839.715.017.83.8
InternVL2-40B1622.936.721.041.921.119.233.010.022.01.7
InternVL3.5-2B1821.734.832.140.13.82.740.811.516.617.0
InternVL3.5-4B1523.737.632.844.84.66.740.715.623.410.7
InternVL3.5-8B428.537.633.647.312.213.242.320.330.221.5
InternVL3.5-14B428.540.333.947.115.515.642.821.832.19.0
InternVL3.5-38B826.940.234.045.311.67.742.720.331.411.1
Qwen2.5VL-32B-Instruct330.041.732.744.423.924.027.716.732.827.3
Qwen2.5VL-72B-Instruct1126.538.520.146.527.726.029.49.429.79.8
Qwen3VL-2B-Instruct2118.433.732.437.52.24.222.46.619.012.5
Qwen3VL-4B-Instruct1921.134.823.746.43.96.924.113.620.417.5
Qwen3VL-8B-Instruct231.238.331.249.321.015.133.321.334.337.8
Qwen3VL-32B-Instruct132.241.928.847.125.311.530.218.636.849.2
LLaVA-OneVision-0.5B2019.627.923.840.813.513.119.910.313.615.7
LLaVA-OneVision-7B1325.735.132.740.816.125.525.616.828.313.4
LLaVA-OneVision-72B826.938.632.241.719.618.335.319.223.016.6
LLaVA-Video-Qwen2-7B1622.937.131.240.917.622.118.217.519.05.7
LLaVA-Video-Qwen2-72B628.339.831.142.223.518.034.220.729.917.0
Ovis2-4B1225.934.630.340.116.118.636.817.722.618.7
Ovis2-16B728.137.035.842.020.06.241.721.723.328.2
Ovis2-34B1026.837.335.540.819.113.137.518.727.415.5

Evaluation results for MLLMs. We highlight the best and second-best results for each sub-task in deeper gray and light gray.

Conclusion

Our findings expose a structural gap between today's MLLMs and the level of spatial understanding required for physically grounded AI. They further suggest that scaling visual encoders or expanding training corpora alone is insufficient; genuine progress will require mechanisms capable of inferring, storing, and manipulating 3D geometric quantities in a principled manner.

Citation

If you find our work useful or interesting, please consider citing our paper:

@article{wu2025indoor,
  title={From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs},
  author={Mingrui Wu and Zhaozhi Wang and Fangjinhua Wang and Jiaolong Yang and Marc Pollefeys and Tong Zhang},
  journal={arXiv preprint arXiv:2512.19683},
  year={2025}}