Code and datasets for Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [paper].
This code is based on the code of, What's "up" with vision-language models? Investigating their struggle with spatial reasoning [paper][code].
The code to load and evaluate each dataset in dataset_zoo/aro_datasets.py. The Question and Answering data is in prompt/.
git clone https://github.com/shiqichen17/AdaptVis.git
mkdir data
mkdir output
pip install requirements.txt
The data all lives in whatsup_vlms/data, which is also where your models will go as they're downloaded.
For all the datasets, setting --download=True (while running python main_aro.py or while instantiating the dataset directly, as mentioned later in this README) will download the data JSONs and images if the files don't already exist.
You can also download the data directly from this Google Drive link. Alternatively, you can download from HuggingFace datasets here.
You can fast implement an example by:
bash run.sh
All parameter choices are indicated in run.sh.
| Argument | Example | Description |
|---|---|---|
dataset |
Controlled_Images_A |
Specifies the dataset you want to evaluate. Can choose from Controlled_Images_A, Controlled_Images_B... |
model |
llava1.5 |
Specifies the model you want to use. |
method |
scaling_vis |
The method for evaluation. Can choose from "scaling_vis" or "adapt_vis". |
weight |
1.2 |
Coefficient for Scaling_vis. Can set from [0, 0.5, 0.8, 1.2, 1.5, 2.0]. |
weight1 |
0.5 |
Coefficient for AdaptVis. Can set from [0.5, 0.8]. |
weight2 |
1.2 |
Coefficient for AdaptVis. Can set from [1.2, 1.5, 2.0]. |
threshold |
0.3 |
Threshold for AdaptVis. |
If you use this code or data, please consider citing our paper:
@misc{chen2025spatialreasoninghardvlms,
title={Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas},
author={Shiqi Chen and Tongyao Zhu and Ruochen Zhou and Jinghan Zhang and Siyang Gao and Juan Carlos Niebles and Mor Geva and Junxian He and Jiajun Wu and Manling Li},
year={2025},
eprint={2503.01773},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.01773},
}
