Skip to content

vlm2-bench/VLM2-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

icon

VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

ACL 2025 Main

Jianshu Zhang1* Dongyu Yao2* Renjie Pi1 Paul Pu Liang3 Yi R. (May) Fung1

1 HKUST Β  2 CMU Β  3 MIT

*: Equal contribution

Project Page ArXiv Hugging Face

[LeaderboardπŸ“Ά]


Benchmark Introduction

VLM2-Bench is the first comprehensive benchmark that evaluates vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, designed to assess fundamental visual linking capabilities that humans use daily, such as identifying the same person across different photos without prior knowledge of their identity. Through extensive evaluation of eight open-source VLMs and GPT-4o using various prompting techniques, we uncover significant challenges in visual cue linking abilities, with even the best model (GPT-4o) performing 34.80% below human level. Our analysis reveals the need for 1) stronger core visual capabilities with less reliance on prior knowledge, 2) better integration of language reasoning in visual tasks, and 3) improved training approaches for independent visual relationship inference.


News

  • 2025/05/15: πŸ”₯ VLM2-Bench is accepted to ACL 2025 Main Conference!

  • 2025/03/12: πŸ”§ We have integrated all 2860 multi-image cases of our VLM2-Bench into VLMEvalKit (example usage here). In the meantime, feel free to follow our repo for local deployment.

  • 2025/02/24: πŸ€— We submit our paper to HF daily paper (Feb 24); your UPVOTES are greatly appreciated! πŸ‘

  • 2025/02/18: πŸš€ The preprint of VLM2-Bench is now officially released!


VLM2-Bench Overview

VLM2-Bench is designed to evaluate models' ability to visually link matching cues across multiple images and videos. It is organized into three main categories:

  • General Cue (GC): Assessing matching and tracking of visual elements.
  • Object-centric Cue (OC): Evaluating comparison, counting, and grouping of objects.
  • Person-centric Cue (PC): Focusing on comparing, counting, grouping, and video identity describing of individuals.

The dataset comprises a total of 3060 question-answer pairs generated via a semi-automated pipeline with human verification, covering various question formats such as True/False, multiple-choice, numerical, and open-ended queries.

VLM2-Bench Overview

VLM2-Bench Overview

Dataset Statistics

VLM2-Bench Statistics

How to Evaluate Your Model on VLM2-Bench

Step 0: Environment Setup

  • Git clone VLM2-Bench:
git clone https://github.com/vlm2-bench/VLM2-Bench.git
cd VLM2-Bench
  • Create a conda environment with Python 3.9:
conda create -n vlm2bench python=3.9
conda activate vlm2bench
pip install openai>=1
pip install -r requirements.txt

For model inference, our benchmark does not require any specific packages. We recommend using the official inference scripts provided by model developers. For example, to test Qwen2.5-VL-7B-Instruct, you can follow the installation and inference instructions at Qwen2.5-VL-7B-Instruct.

Step 1: Download the Data

  • Download the VLM2-Bench dataset from our huggingface repository link and unzip it at the root directory of this repository:
unzip vlm2-bench_dataset.zip

after unzip, you will see the following structure:

vlm2-bench/
β”œβ”€β”€ code
β”‚   β”œβ”€β”€ gc
β”‚   β”œβ”€β”€ oc
β”‚   β”œβ”€β”€ pc
β”œβ”€β”€ data (images and videos)
β”‚   β”œβ”€β”€ gc
β”‚   β”œβ”€β”€ oc
β”‚   β”œβ”€β”€ pc
β”œβ”€β”€ jsonl (question files)
β”‚   β”œβ”€β”€ gc
β”‚   β”‚   └── vanilla
β”‚   β”‚       └── gc_mat.jsonl
β”‚   β”‚       └── gc_trk.jsonl
β”‚   β”œβ”€β”€ oc
β”‚   β”œβ”€β”€ pc

Step 2: Run Model Inference

example usage for single model on gc_mat task:

python code/gc/test/test_script_example/test_qwen2p5_7B_img_qa_gc.py \
--question_file "jsonl/gc/vanilla/gc_mat.jsonl" \
--image_folder "data/gc/processed" \
--output_dir "code/gc/test/test_res/test_mat"
  • Additionally, under the test directory of each task, there is a complete bash script for sequential testing on multiple models, for example: code/gc/test/run_gc_full_round.bash.

Example commands:

bash code/gc/test/run_gc_full_round.bash

this script will run the model for gc_mat and gc_trk tasks, and save the results in the code/gc/test/test_res directory.

For more details, please refer to the .bash scripts for each task directly. You may easily navigate to these files following the Roadmap below.

Roadmap of inference scripts and bash scripts for all tasks in VLM2-Bench

example model: Qwen2.5-VL-7B-Instruct

Step 3: Evaluate the Results

We provide separate evaluation scripts for each task as well as an all-in-one evaluation script (jupyter notebook) for evaluating all tasks.

  • Navigate into the project directory, then run the evaluation script in vlm2bench_evaluator.ipynb. Remember to set the correct path to your result folder according to the instructions in the notebook.
  • To evaluate the results of a single task, you can either run the script in the notebook or run the bash script in the eval directory of the task (for example, code/gc/eval/eval_tf_batch_pair_3acc.py).

Evaluation with VLMEvalKit on all 2860 image cases

Please refer to the Quick Start tutorial of this toolkit for detailed instructions on setting up the environment and running the inference.

A simple inference example on our dataset (with name VLM2Bench) can be executed using:

python run.py \
--data VLM2Bench \
--model Qwen2.5-VL-7B-Instruct \
--work-dir /path/to/your/result/folder

Experimental Results

The leaderboard is shown below:

Leaderboard

Our evaluation on 8 state-of-the-art open-source vision-language models and GPT-4o shows:

  • Significant Performance Gap: Even the best-performing model (GPT-4o) is on average ~34.80% behind human performance.
  • Diverse Performance Patterns: Models exhibit distinct strengths and weaknesses across various visual cue categories, indicating the need for specialized improvements.

Citation

If you find this work useful, please cite our paper:

@misc{zhang2025vlm2benchcloserlookvlms,
      title={VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues}, 
      author={Jianshu Zhang and Dongyu Yao and Renjie Pi and Paul Pu Liang and Yi R. Fung},
      year={2025},
      eprint={2502.12084},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12084}, 
}

Contact

Jianshu Zhang: [email protected]

Yi R. (May) Fung: [email protected]


License

Code: Licensed under the Apache 2.0 License. Dataset: Licensed under the CC BY-NC 4.0 License.

About

VLM2-Bench [ACL 2025 Main]: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •