[Benchmark] Support VisuLogic by xwy-bit · Pull Request #944 · open-compass/VLMEvalKit

xwy-bit · 2025-04-23T11:47:19Z

VisuLogic Resouces

📖 Introduction

VisuLogic is a newly designed benchmark aimed at evaluating the visual reasoning capabilities of Multi-modal Large Language Models (MLLMs), independent of textual reasoning processes. It features carefully constructed visual reasoning tasks spanning multiple categories, divided into six types based on required reasoning skills (e.g., Quantitative Reasoning, which involves understanding and deducing changes in the quantity of elements in images). Unlike existing benchmarks, VisuLogic is a challenging visual reasoning benchmark that is inherently difficult to articulate using language, providing a more rigorous evaluation of the visual reasoning capabilities of MLLMs. Most models score below 30% accuracy—only slightly above the 25% random baseline and far below the 51.4% achieved by humans—revealing significant gaps in visual reasoning.

🌟 Key Features

🚀 Visuo-Logical Challenge
The first benchmark to integrate visual perception with logical reasoning, enabling authentic multimodal evaluation. Most models score below 30% accuracy—only slightly above the 25% random baseline and far below the 51.4% achieved by humans—revealing significant gaps in visual reasoning.
🛠️ Rigorous Design
Includes 1,000 meticulously curated questions, spanning 6 domains and 23 subcategories, for comprehensive performance evaluation.
📝 Anti-Linguistic Shortcut
Designed to avoid linguistic reasoning, ensuring tasks rely on genuine visual reasoning rather than shortcuts.
💡 RL Exploration
We identify the RL technique as a promising direction for improving the visual reasoning capabilities of MLLMs. Through RL method, models reach SOTA in VisuLogic!
✅ Fully Open-source
We open-source all the evaluation code, training scripts, and datasets associated with this work to promote further research and innovation.

kennymckormick · 2025-04-25T13:50:03Z

A challenging benchmark, tested GPT4o-mini on that and got 25.2% acc, close to the random baseline.

Co-authored-by: Haodong Duan <[email protected]>

* add vgrpbench * remove unnecessary files * [Improvement] Allow setting model name for lmdeploy wrapper (#913) Signed-off-by: Isotr0py <[email protected]> * [Minor] Add GPT-4.1 * [Fix] Fix function extract_subjective in dataset/creation.py (#911) * creation: extract_subjective * fix lint * [Fix] fix LA mode in HLE * [Fix] Fix COT Prompt BUG (#922) * [Patch] Bypass SSL (#923) * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Dataset ] add support for SAIL-VL-1.5 (#926) * 修改提交 * 提交名称修改 * 去除提交 * 去除提交 * 文件名修改 * 文件名修改 * 格式修复 --------- Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> * remove unnecessary file * [Fix] fix physics_yale with not using custom prompt in internvl series * [Model] Support SAIL-VL-1.6 (#939) Co-authored-by: qiuyan.kk <[email protected]> * [Minor] More info in tqdm progress bar (#937) * [Feature] Add vLLM support for Qwen2-VL/Qwen2.5-VL (#935) Co-authored-by: TianhaoLiang2000 <[email protected]> * [Benchmark] Support MMIFEval (#938) * add mmifeval * add req nltk * [Fix] update url and remove unnecessary log --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Model] Add support for Janus-Pro-1B (#945) add support for Janus-Pro-1B * [Minor] Patch to fix DynaMath preprocess * add vgrpbench * [Fix] Fix Lint * remove files * add vgrpbench's format json files, and update gitignore rule * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Benchmark] Add Support for Spatial457 Benchmark (CVPR 2025 Highlight) (#932) * update spatial457 * fix format * update readme * update README.md * update summarize.py * update dataset/__init__.py * update summarize.py * Revert image_vqa.py * add back spatial457 * Implement a more robust strategy for Spatial457 answer matching --------- Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Haodong Duan <[email protected]> * [Model] add support for Qwen2.5-Omni (#883) * add support for qwen2.5_omni * add support for qwen2.5_omni (only single process) * update model cls for qwen2_5omni * Delete VIDEO_DLC_scripts/MMSci_internvl2_8b.sh * Delete VIDEO_DLC_scripts/video_lb_update_cu118_smol.sh * Delete VIDEO_DLC_scripts/video_lb_update_qwen2_5_vl_7b.sh * Delete files * [Fix] Fix Lint --------- Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: kennymckormick <[email protected]> * [Benchmark] Support VisuLogic (#944) Co-authored-by: Haodong Duan <[email protected]> * [Minor] Support Gemini 2.5 Flash / Pro (#958) * [Minor] Add Explicit Format Instruction for AMBER (#961) * [Minor] Fix all_finished return null (#951) * [Benchmark] Support CVBench (CV-Bench-2D, CV-Bench-3D) (#909) * [Benchmark] Support CVBench, including CV-Bench-2D, CV-Bench-3D two sub tasks. * fix(image_mcq.py): prompt error * [Fix] Fix vllm with config (#953) * fix use config with vllm * fix * update * [Fix] Fix MM-IFEval & Custom Prompt in InternVL (#959) * [Model] Support SenseNova-V6-Pro (#964) * [Model] Support SenseNova-V6 * update model config * update * update * update config * [Benchmark] Support TDBench (#947) * [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint * [Fix] Refactor Task Launching Policy (#952) * Update run.py * [Refactor] Set CUDA_VISIBLE_DEVICES at the beginning * [Minor] auto / cuda device for several VLMs * [Doc] Update Doc * [Minor] Update CV-Bench URL * [Fix] Fix tmp ans load error in MM-IFEval (#969) * Fix tmp ans load error in MM-IFEval * Fix KeyError 0 * add vgrpbench * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> --------- Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Shengyuan Ding <[email protected]> Co-authored-by: Xinyu Fang <[email protected]> Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: suencgo <[email protected]> Co-authored-by: cmatachuan <[email protected]> Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> Co-authored-by: Xiangyu Zhao <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: Jiang Li <[email protected]> Co-authored-by: Xingrui Wang <[email protected]> Co-authored-by: xwy-bit <[email protected]> Co-authored-by: psp_dada <[email protected]> Co-authored-by: MaoSong2022 <[email protected]> Co-authored-by: Scott Zhao <[email protected]>

Co-authored-by: Haodong Duan <[email protected]>

* add vgrpbench * remove unnecessary files * [Improvement] Allow setting model name for lmdeploy wrapper (open-compass#913) Signed-off-by: Isotr0py <[email protected]> * [Minor] Add GPT-4.1 * [Fix] Fix function extract_subjective in dataset/creation.py (open-compass#911) * creation: extract_subjective * fix lint * [Fix] fix LA mode in HLE * [Fix] Fix COT Prompt BUG (open-compass#922) * [Patch] Bypass SSL (open-compass#923) * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (open-compass#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Dataset ] add support for SAIL-VL-1.5 (open-compass#926) * 修改提交 * 提交名称修改 * 去除提交 * 去除提交 * 文件名修改 * 文件名修改 * 格式修复 --------- Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> * remove unnecessary file * [Fix] fix physics_yale with not using custom prompt in internvl series * [Model] Support SAIL-VL-1.6 (open-compass#939) Co-authored-by: qiuyan.kk <[email protected]> * [Minor] More info in tqdm progress bar (open-compass#937) * [Feature] Add vLLM support for Qwen2-VL/Qwen2.5-VL (open-compass#935) Co-authored-by: TianhaoLiang2000 <[email protected]> * [Benchmark] Support MMIFEval (open-compass#938) * add mmifeval * add req nltk * [Fix] update url and remove unnecessary log --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Model] Add support for Janus-Pro-1B (open-compass#945) add support for Janus-Pro-1B * [Minor] Patch to fix DynaMath preprocess * add vgrpbench * [Fix] Fix Lint * remove files * add vgrpbench's format json files, and update gitignore rule * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (open-compass#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Benchmark] Add Support for Spatial457 Benchmark (CVPR 2025 Highlight) (open-compass#932) * update spatial457 * fix format * update readme * update README.md * update summarize.py * update dataset/__init__.py * update summarize.py * Revert image_vqa.py * add back spatial457 * Implement a more robust strategy for Spatial457 answer matching --------- Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Haodong Duan <[email protected]> * [Model] add support for Qwen2.5-Omni (open-compass#883) * add support for qwen2.5_omni * add support for qwen2.5_omni (only single process) * update model cls for qwen2_5omni * Delete VIDEO_DLC_scripts/MMSci_internvl2_8b.sh * Delete VIDEO_DLC_scripts/video_lb_update_cu118_smol.sh * Delete VIDEO_DLC_scripts/video_lb_update_qwen2_5_vl_7b.sh * Delete files * [Fix] Fix Lint --------- Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: kennymckormick <[email protected]> * [Benchmark] Support VisuLogic (open-compass#944) Co-authored-by: Haodong Duan <[email protected]> * [Minor] Support Gemini 2.5 Flash / Pro (open-compass#958) * [Minor] Add Explicit Format Instruction for AMBER (open-compass#961) * [Minor] Fix all_finished return null (open-compass#951) * [Benchmark] Support CVBench (CV-Bench-2D, CV-Bench-3D) (open-compass#909) * [Benchmark] Support CVBench, including CV-Bench-2D, CV-Bench-3D two sub tasks. * fix(image_mcq.py): prompt error * [Fix] Fix vllm with config (open-compass#953) * fix use config with vllm * fix * update * [Fix] Fix MM-IFEval & Custom Prompt in InternVL (open-compass#959) * [Model] Support SenseNova-V6-Pro (open-compass#964) * [Model] Support SenseNova-V6 * update model config * update * update * update config * [Benchmark] Support TDBench (open-compass#947) * [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint * [Fix] Refactor Task Launching Policy (open-compass#952) * Update run.py * [Refactor] Set CUDA_VISIBLE_DEVICES at the beginning * [Minor] auto / cuda device for several VLMs * [Doc] Update Doc * [Minor] Update CV-Bench URL * [Fix] Fix tmp ans load error in MM-IFEval (open-compass#969) * Fix tmp ans load error in MM-IFEval * Fix KeyError 0 * add vgrpbench * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (open-compass#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> --------- Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Shengyuan Ding <[email protected]> Co-authored-by: Xinyu Fang <[email protected]> Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: suencgo <[email protected]> Co-authored-by: cmatachuan <[email protected]> Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> Co-authored-by: Xiangyu Zhao <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: Jiang Li <[email protected]> Co-authored-by: Xingrui Wang <[email protected]> Co-authored-by: xwy-bit <[email protected]> Co-authored-by: psp_dada <[email protected]> Co-authored-by: MaoSong2022 <[email protected]> Co-authored-by: Scott Zhao <[email protected]>

[feat] support visulogic

f4b6fdd

xwy-bit changed the title ~~[feat] support visulogic~~ [Benchmark] support visulogic Apr 23, 2025

kennymckormick changed the title ~~[Benchmark] support visulogic~~ [Benchmark] Support VisuLogic Apr 25, 2025

Merge branch 'main' into visulogic

c4edd08

[Improve] Fix VisuLogic Print

1a0194d

kennymckormick merged commit 6ea82af into open-compass:main Apr 25, 2025

kennymckormick added a commit to ryf1123/VLMEvalKit that referenced this pull request Apr 30, 2025

[Benchmark] Support VisuLogic (open-compass#944)

824e2f9

Co-authored-by: Haodong Duan <[email protected]>

Koii2k3 pushed a commit to wjnwjn59/VLMEvalKit that referenced this pull request Nov 13, 2025

[Benchmark] Support VisuLogic (open-compass#944)

0739c57

Co-authored-by: Haodong Duan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Support VisuLogic#944

[Benchmark] Support VisuLogic#944
kennymckormick merged 3 commits intoopen-compass:mainfrom
xwy-bit:visulogic

xwy-bit commented Apr 23, 2025

Uh oh!

kennymckormick commented Apr 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xwy-bit commented Apr 23, 2025

VisuLogic Resouces

📖 Introduction

🌟 Key Features

Uh oh!

kennymckormick commented Apr 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants