Skip to content

VisuLogic-Benchmark/VisuLogic-Eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

55 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

A Challenging Visual-centric Benchmark for Evaluating Multimodal Reasoning in MLLMs!

This is the Eval code repo of VisuLogic.

For more details, please refer to the project page for dataset exploration, code repos and visualization tools: https://visulogic-benchmark.github.io/VisuLogic/.

VisuLogic Resouces

๐ŸŒ Homepage | ๐Ÿ† Leaderboard | ๐Ÿ“– Paper | ๐Ÿค— Benchmark | ๐Ÿ’ป Eval Code | ๐Ÿค— Train Data | ๐Ÿ’ป Train Code

๐Ÿ””News

  • ๐Ÿ”ฅ[2025-06-28] Release the SFT data! ๐Ÿš€
  • ๐Ÿ”ฅ[2025-04-26] VisuLogic has been merged into VLMEvalkit. You can evaluate your model on VisuLogic with it ! Usage see VLMEvalkit ! ๐Ÿš€
  • ๐Ÿ”ฅ[2025-04-22] Release the paper, training data and training code! ๐Ÿš€
  • ๐Ÿ”ฅ[2025-04-08] Release the benchmark and the code! ๐Ÿš€

โœ… To-do

  • Release the benchmark dataset and eval code
  • Release training code
  • Release the paper
  • Release the training dataset
  • Release model ckpts

๐Ÿ“– Introduction

VisuLogic is a newly designed benchmark aimed at evaluating the visual reasoning capabilities of Multi-modal Large Language Models (MLLMs), independent of textual reasoning processes. It features carefully constructed visual reasoning tasks spanning multiple categories, divided into six types based on required reasoning skills (e.g., Quantitative Reasoning, which involves understanding and deducing changes in the quantity of elements in images). Unlike existing benchmarks, VisuLogic is a challenging visual reasoning benchmark that is inherently difficult to articulate using language, providing a more rigorous evaluation of the visual reasoning capabilities of MLLMs. Most models score below 30% accuracyโ€”only slightly above the 25% random baseline and far below the 51.4% achieved by humansโ€”revealing significant gaps in visual reasoning. Overview

๐ŸŒŸ Key Features

  • ๐Ÿš€ Visuo-Logical Challenge
    The first benchmark to integrate visual perception with logical reasoning, enabling authentic multimodal evaluation. Most models score below 30% accuracyโ€”only slightly above the 25% random baseline and far below the 51.4% achieved by humansโ€”revealing significant gaps in visual reasoning.

  • ๐Ÿ› ๏ธ Rigorous Design
    Includes 1,000 meticulously curated questions, spanning 6 domains and 23 subcategories, for comprehensive performance evaluation.

  • ๐Ÿ“ Anti-Linguistic Shortcut
    Designed to avoid linguistic reasoning, ensuring tasks rely on genuine visual reasoning rather than shortcuts.

  • ๐Ÿ’ก RL Exploration
    We identify the RL technique as a promising direction for improving the visual reasoning capabilities of MLLMs. Through RL method, models reach SOTA in VisuLogic!

  • โœ… Fully Open-source
    We open-source all the evaluation code, training scripts, and datasets associated with this work to promote further research and innovation.

๐Ÿ–ผ๏ธ Examples of VisuLogic

Examples of VisuLogic

Installation & Preparation

๐Ÿ› ๏ธ Default Installation

For InternVL series, QwenVL series, glm-4v, ovis2, mplug-om3, llava-onevision

pip install -r requirements.txt

๐Ÿ› ๏ธ For Specific Models

minicpm-o Installation

pip install -r requirements.txt
pip install transformers==4.44.2

llava Installation

pip install -r requirements.txt
pip install transformers==4.37

sharegpt4v Installation

For more details, please refer to this link.

pip install -r requirements.txt
pip install transformers==4.37

๐Ÿ“‚ Prepare Benchmark Data

  1. Download huggingface dataset in https://huggingface.co/datasets/VisuLogic/VisuLogic
  2. unzip images.zip
|- ...
|- data.jsonl
|- images/ (unzip from images.zip)
  |- 00000.png
  |- 00001.png

๐Ÿš€ Evaluate Dedfault Models

For example, just find the corresponding model and execute its script.

sh scripts/eval_internvl.sh

๐Ÿ”ง Evaluate Your Own Model

VisuLogic provides a clean and extensible framework to evaluate custom models. You only need to add & change 2 files

Steps to Add Your Model.

  1. add model/mymodel.py with template as following:
from models.base_model import BaseModel
class mymodel(BaseModel):
    def __init__(self, model_path: str, user_prompt: str = None):
      pass

    def predict(self, input_data: Any) -> Any:
      """
        Model prediction interface
        Args:
            input_data: 
              input_data['text'] # question text
              input_data['image_path'] # image path of question
      """
        pass
    
    @property
    def name(self) -> str:
        """Model name"""
        pass
  1. modified model/__init__.py
...
from models.mymodel import mymodel
def load_model(args):
  ...
  elif 'mymodel' in args.model_path.lower():
    model = mymodel(model_path = args.model_path,
                    user_prompt = args.user_prompt)
  ...
  return model
  1. run scripts
mkdir -p outputs/
python evaluation/eval_model.py \
    --input_file path/to/data.jsonl \
    --output_file outputs/output_file.jsonl \
    --model_path mymodel \
    --judge_api_key sk-xxx

๐Ÿ› ๏ธ Pipeline of Evaluation

pipeline of response filter VisuLogic evaluates model accuracy by combining boxed, predefined, and LLM-based extraction methods to produce a single choice (a/b/c/d), then compares it with the ground-truth label to determine correctness.

๐Ÿ“ฆ Training

Please refer to VisuLogic-Train for training code.

๐Ÿ“ฉ Contact

๐Ÿ“œ Citation

BibTeX:

@article{xu2025visulogic,
  title={VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models},
  author={Xu, Weiye and Wang, Jiahao and Wang, Weiyun and Chen, Zhe and Zhou, Wengang and Yang, Aijun and Lu, Lewei and Li, Houqiang and Wang, Xiaohua and Zhu, Xizhou and Wang, Wenhai and Dai, Jifeng and Zhu, Jinguo},
  journal={arXiv preprint arXiv:2504.15279},
  year={2025},
  url={https://arxiv.org/abs/2504.15279}
}

๐ŸŽ‰ Thank you for your interest in VisuLogic! We hope this benchmark helps drive advancements in multimodal reasoning! ๐Ÿš€

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors