What's the length of the bed in meters?
Would you change your mind after seeing this?
Gemini estimates the bed to be 1.9 meters long, citing the average dimensions of a double bed, even when presented with the contradictory visual evidence of the cat. This indicates that MLLMs rely heavily on linguistic priors for spatial reasoning, often disregarding critical visual cues.
The spatial intelligence observed on indoor benchmarks is a mirage: the significant and coherent performance gains on VSI-Bench[Yang et al. 2025] are largely a result of same-origin training data and linguistic priors in indoor scenes. These capabilities fail to transfer to open-world settings, where linguistic priors are weak.
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence—crucial for robust and grounded AI systems—remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum—from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
| Methods | Rank | Avg. | Relational (MCA) | Static Metric (NA) | Dynamic Metric (NA) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
Rel. Dis. |
Rel. Dir. |
Qual. EM |
Obj. Loc. |
Abs. Dis. |
Depth Count |
Abs. Displ. |
Abs. Speed |
Quan. EM |
|||
| Against Human on tiny | |||||||||||
| Human-level | 1 | 60.3 | 85.7 | 83.3 | 73.7 | 43.9 | 39.2 | 67.5 | 42.9 | 65.8 | 66.8 |
| Gemini-2.5-Pro | 2 | 36.8 | 53.1 | 23.1 | 46.7 | 39.7 | 33.8 | 40.3 | 22.2 | 27.8 | 40.0 |
| GPT-5 | 4 | 27.9 | 37.5 | 30.8 | 40.0 | 35.3 | 25.3 | 12.8 | 9.2 | 31.4 | 33.0 |
| Qwen3VL-32B-Instruct | 3 | 31.9 | 56.3 | 23.1 | 33.3 | 26.2 | 10.9 | 32.8 | 14.4 | 37.2 | 52.3 |
| Closed-source Models | |||||||||||
| Gemini-2.5-Pro | 1 | 37.2 | 50.0 | 28.1 | 52.5 | 37.4 | 28.1 | 37.9 | 26.8 | 31.1 | 40.8 |
| Gemini-2.5-Flash | 6 | 19.5 | 17.9 | 2.8 | 50.6 | 22.7 | 16.1 | 26.8 | 8.1 | 6.8 | 20.0 |
| GPT-5 | 2 | 29.7 | 34.4 | 33.1 | 49.5 | 32.5 | 23.7 | 20.9 | 10.5 | 33.8 | 30.6 |
| GPT-4o | 5 | 25.9 | 30.8 | 29.1 | 42.2 | 22.9 | 27.0 | 21.6 | 17.5 | 15.5 | 28.8 |
| Claude-3.7-Sonnet | 4 | 26.5 | 38.9 | 32.8 | 47.6 | 31.3 | 22.4 | 31.5 | 5.2 | 30.1 | 5.0 |
| Doubao-Seed-1.6V | 3 | 27.3 | 35.9 | 24.1 | 44.0 | 16.6 | 18.9 | 38.7 | 25.7 | 31.8 | 9.2 |
| Open-source Models | |||||||||||
| InternVL2-8B | 14 | 24.5 | 35.1 | 31.7 | 40.8 | 21.8 | 17.8 | 39.7 | 15.0 | 17.8 | 3.8 |
| InternVL2-40B | 16 | 22.9 | 36.7 | 21.0 | 41.9 | 21.1 | 19.2 | 33.0 | 10.0 | 22.0 | 1.7 |
| InternVL3.5-2B | 18 | 21.7 | 34.8 | 32.1 | 40.1 | 3.8 | 2.7 | 40.8 | 11.5 | 16.6 | 17.0 |
| InternVL3.5-4B | 15 | 23.7 | 37.6 | 32.8 | 44.8 | 4.6 | 6.7 | 40.7 | 15.6 | 23.4 | 10.7 |
| InternVL3.5-8B | 4 | 28.5 | 37.6 | 33.6 | 47.3 | 12.2 | 13.2 | 42.3 | 20.3 | 30.2 | 21.5 |
| InternVL3.5-14B | 4 | 28.5 | 40.3 | 33.9 | 47.1 | 15.5 | 15.6 | 42.8 | 21.8 | 32.1 | 9.0 |
| InternVL3.5-38B | 8 | 26.9 | 40.2 | 34.0 | 45.3 | 11.6 | 7.7 | 42.7 | 20.3 | 31.4 | 11.1 |
| Qwen2.5VL-32B-Instruct | 3 | 30.0 | 41.7 | 32.7 | 44.4 | 23.9 | 24.0 | 27.7 | 16.7 | 32.8 | 27.3 |
| Qwen2.5VL-72B-Instruct | 11 | 26.5 | 38.5 | 20.1 | 46.5 | 27.7 | 26.0 | 29.4 | 9.4 | 29.7 | 9.8 |
| Qwen3VL-2B-Instruct | 21 | 18.4 | 33.7 | 32.4 | 37.5 | 2.2 | 4.2 | 22.4 | 6.6 | 19.0 | 12.5 |
| Qwen3VL-4B-Instruct | 19 | 21.1 | 34.8 | 23.7 | 46.4 | 3.9 | 6.9 | 24.1 | 13.6 | 20.4 | 17.5 |
| Qwen3VL-8B-Instruct | 2 | 31.2 | 38.3 | 31.2 | 49.3 | 21.0 | 15.1 | 33.3 | 21.3 | 34.3 | 37.8 |
| Qwen3VL-32B-Instruct | 1 | 32.2 | 41.9 | 28.8 | 47.1 | 25.3 | 11.5 | 30.2 | 18.6 | 36.8 | 49.2 |
| LLaVA-OneVision-0.5B | 20 | 19.6 | 27.9 | 23.8 | 40.8 | 13.5 | 13.1 | 19.9 | 10.3 | 13.6 | 15.7 |
| LLaVA-OneVision-7B | 13 | 25.7 | 35.1 | 32.7 | 40.8 | 16.1 | 25.5 | 25.6 | 16.8 | 28.3 | 13.4 |
| LLaVA-OneVision-72B | 8 | 26.9 | 38.6 | 32.2 | 41.7 | 19.6 | 18.3 | 35.3 | 19.2 | 23.0 | 16.6 |
| LLaVA-Video-Qwen2-7B | 16 | 22.9 | 37.1 | 31.2 | 40.9 | 17.6 | 22.1 | 18.2 | 17.5 | 19.0 | 5.7 |
| LLaVA-Video-Qwen2-72B | 6 | 28.3 | 39.8 | 31.1 | 42.2 | 23.5 | 18.0 | 34.2 | 20.7 | 29.9 | 17.0 |
| Ovis2-4B | 12 | 25.9 | 34.6 | 30.3 | 40.1 | 16.1 | 18.6 | 36.8 | 17.7 | 22.6 | 18.7 |
| Ovis2-16B | 7 | 28.1 | 37.0 | 35.8 | 42.0 | 20.0 | 6.2 | 41.7 | 21.7 | 23.3 | 28.2 |
| Ovis2-34B | 10 | 26.8 | 37.3 | 35.5 | 40.8 | 19.1 | 13.1 | 37.5 | 18.7 | 27.4 | 15.5 |
Evaluation results for MLLMs. We highlight the best and second-best results for each sub-task in deeper gray and light gray.
Our findings expose a structural gap between today's MLLMs and the level of spatial understanding required for physically grounded AI. They further suggest that scaling visual encoders or expanding training corpora alone is insufficient; genuine progress will require mechanisms capable of inferring, storing, and manipulating 3D geometric quantities in a principled manner.
If you find our work useful or interesting, please consider citing our paper:
@article{wu2025indoor,
title={From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs},
author={Mingrui Wu and Zhaozhi Wang and Fangjinhua Wang and Jiaolong Yang and Marc Pollefeys and Tong Zhang},
journal={arXiv preprint arXiv:2512.19683},
year={2025}}