1 HKUST Β 2 CMU Β 3 MIT
*: Equal contribution
VLM2-Bench is the first comprehensive benchmark that evaluates vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, designed to assess fundamental visual linking capabilities that humans use daily, such as identifying the same person across different photos without prior knowledge of their identity. Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking abilities, with even the best model (GPT-4o) performing 34.80% below human level. Our analysis reveals the need for 1) stronger core visual capabilities with less reliance on prior knowledge, 2) better integration of language reasoning in visual tasks, and 3) improved training approaches for independent visual relationship inference.
-
2025/05/15: π₯ VLM2-Bench is accepted to ACL 2025 Main Conference!
-
2025/03/12: π§ We have integrated all 2860 multi-image cases of our VLM2-Bench into VLMEvalKit (example usage here). In the meantime, feel free to follow our repo for local deployment.
-
2025/02/24: π€ We submit our paper to HF daily paper (Feb 24); your UPVOTES are greatly appreciated! π
-
2025/02/18: π The preprint of VLM2-Bench is now officially released!
VLM2-Bench is designed to evaluate models' ability to visually link matching cues across multiple images and videos. It is organized into three main categories:
- General Cue (GC): Assessing matching and tracking of visual elements.
- Object-centric Cue (OC): Evaluating comparison, counting, and grouping of objects.
- Person-centric Cue (PC): Focusing on comparing, counting, grouping, and video identity describing of individuals.
The dataset comprises a total of 3060 question-answer pairs generated via a semi-automated pipeline with human verification, covering various question formats such as True/False, multiple-choice, numerical, and open-ended queries.
- Git clone VLM2-Bench:
git clone https://github.com/vlm2-bench/VLM2-Bench.git
cd VLM2-Bench- Create a conda environment with Python 3.9:
conda create -n vlm2bench python=3.9
conda activate vlm2bench
pip install openai>=1
pip install -r requirements.txtFor model inference, our benchmark does not require any specific packages. We recommend using the official inference scripts provided by model developers. For example, to test Qwen2.5-VL-7B-Instruct, you can follow the installation and inference instructions at Qwen2.5-VL-7B-Instruct.
- Download the VLM2-Bench dataset from our huggingface repository link and unzip it at the root directory of this repository:
unzip vlm2-bench_dataset.zipafter unzip, you will see the following structure:
vlm2-bench/
βββ code
β βββ gc
β βββ oc
β βββ pc
βββ data (images and videos)
β βββ gc
β βββ oc
β βββ pc
βββ jsonl (question files)
β βββ gc
β β βββ vanilla
β β βββ gc_mat.jsonl
β β βββ gc_trk.jsonl
β βββ oc
β βββ pc- We provide example inference code for Qwen2.5-VL-7B under each task's test_script_example directory, for example: code/gc/test/test_script_example/test_qwen2p5_7B_img_qa_gc.py.
example usage for single model on gc_mat task:
python code/gc/test/test_script_example/test_qwen2p5_7B_img_qa_gc.py \
--question_file "jsonl/gc/vanilla/gc_mat.jsonl" \
--image_folder "data/gc/processed" \
--output_dir "code/gc/test/test_res/test_mat"- Additionally, under the test directory of each task, there is a complete bash script for sequential testing on multiple models, for example: code/gc/test/run_gc_full_round.bash.
Example commands:
bash code/gc/test/run_gc_full_round.bashthis script will run the model for gc_mat and gc_trk tasks, and save the results in the code/gc/test/test_res directory.
For more details, please refer to the .bash scripts for each task directly. You may easily navigate to these files following the Roadmap below.
example model: Qwen2.5-VL-7B-Instruct
- GC
- inference script: code/gc/test/test_script_example/test_qwen2p5_7B_img_qa_gc.py
- bash script: code/gc/test/run_gc_full_round.bash
- OC
- inference script: code/oc/test/test_script_example/test_qwen2p5_7B_img_qa_oc.py
- bash script: code/oc/test/run_oc_full_round.bash
- PC-image
- inference script: code/pc/image/test/test_script_example/test_qwen2p5_7B_img_qa_pc.py
- bash script: code/pc/image/test/run_pc-i_full_round.bash
- PC-video (open-ended)
- inference script: code/pc/video/test/test_script_example/test_qwen2p5_7B_vid_qa_pc-v.py
- bash script: code/pc/video/test/run_pc-v_full_round.bash
We provide separate evaluation scripts for each task as well as an all-in-one evaluation script (jupyter notebook) for evaluating all tasks.
- Navigate into the project directory, then run the evaluation script in
vlm2bench_evaluator.ipynb. Remember to set the correct path to your result folder according to the instructions in the notebook. - To evaluate the results of a single task, you can either run the script in the notebook or run the bash script in the
evaldirectory of the task (for example, code/gc/eval/eval_tf_batch_pair_3acc.py).
Evaluation with VLMEvalKit on all 2860 image cases
Please refer to the Quick Start tutorial of this toolkit for detailed instructions on setting up the environment and running the inference.
A simple inference example on our dataset (with name VLM2Bench) can be executed using:
python run.py \
--data VLM2Bench \
--model Qwen2.5-VL-7B-Instruct \
--work-dir /path/to/your/result/folderThe leaderboard is shown below:
Our evaluation on 8 state-of-the-art open-source vision-language models and GPT-4o shows:
- Significant Performance Gap: Even the best-performing model (GPT-4o) is on average ~34.80% behind human performance.
- Diverse Performance Patterns: Models exhibit distinct strengths and weaknesses across various visual cue categories, indicating the need for specialized improvements.
If you find this work useful, please cite our paper:
@misc{zhang2025vlm2benchcloserlookvlms,
title={VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues},
author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
year={2025},
eprint={2502.12084},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12084},
}Jianshu Zhang: [email protected]
Yi R. (May) Fung: [email protected]
Code: Licensed under the Apache 2.0 License. Dataset: Licensed under the CC BY-NC 4.0 License.


