Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Code and datasets for Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [paper].

This code is based on the code of, What's "up" with vision-language models? Investigating their struggle with spatial reasoning [paper][code].

Datasets

The code to load and evaluate each dataset in dataset_zoo/aro_datasets.py. The Question and Answering data is in prompt/.

Method: ScalingVis and AdaptVis

Setting Up the environment

git clone https://github.com/shiqichen17/AdaptVis.git
mkdir data
mkdir output
pip install requirements.txt

Downloading the data

The data all lives in whatsup_vlms/data, which is also where your models will go as they're downloaded.

For all the datasets, setting --download=True (while running python main_aro.py or while instantiating the dataset directly, as mentioned later in this README) will download the data JSONs and images if the files don't already exist.

You can also download the data directly from this Google Drive link. Alternatively, you can download from HuggingFace datasets here.

Running experiments scaling_vis and adapt_vis

You can fast implement an example by:

bash run.sh

Argument

All parameter choices are indicated in run.sh.

Argument	Example	Description
`dataset`	`Controlled_Images_A`	Specifies the dataset you want to evaluate. Can choose from `Controlled_Images_A, Controlled_Images_B..`.
`model`	`llava1.5`	Specifies the model you want to use.
`method`	`scaling_vis`	The method for evaluation. Can choose from `"scaling_vis"` or `"adapt_vis"`.
`weight`	`1.2`	Coefficient for Scaling_vis. Can set from `[0, 0.5, 0.8, 1.2, 1.5, 2.0]`.
`weight1`	`0.5`	Coefficient for AdaptVis. Can set from `[0.5, 0.8]`.
`weight2`	`1.2`	Coefficient for AdaptVis. Can set from `[1.2, 1.5, 2.0]`.
`threshold`	`0.3`	Threshold for AdaptVis.

Citation

If you use this code or data, please consider citing our paper:

@misc{chen2025spatialreasoninghardvlms,
      title={Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas}, 
      author={Shiqi Chen and Tongyao Zhu and Ruochen Zhou and Jinghan Zhang and Siyang Gao and Juan Carlos Niebles and Mor Geva and Junxian He and Jiajun Wu and Manling Li},
      year={2025},
      eprint={2503.01773},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.01773}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
dataset_zoo		dataset_zoo
figures		figures
misc		misc
model_zoo		model_zoo
prompts		prompts
.gitignore		.gitignore
README.md		README.md
main_aro.py		main_aro.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Datasets

Method: ScalingVis and AdaptVis

Setting Up the environment

Downloading the data

Running experiments scaling_vis and adapt_vis

Argument

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

shiqichen17/AdaptVis

Folders and files

Latest commit

History

Repository files navigation

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Datasets

Method: ScalingVis and AdaptVis

Setting Up the environment

Downloading the data

Running experiments scaling_vis and adapt_vis

Argument

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages