[Benchmark] Support TDBench by zhaomh1998 · Pull Request #947 · open-compass/VLMEvalKit

zhaomh1998 · 2025-04-24T03:25:05Z

Add support for TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images.

Repo: https://github.com/Columbia-ICSL/TDBench
arXiv: https://arxiv.org/abs/2504.03748
Dataset: https://huggingface.co/datasets/Columbia-ICSL/TDBench

This benchmark includes 10 dimensions (9 MCQ dimensions + 1 VQA dimension), each including 200 human-designed questions on four different rotations, along with 4 case studies (MCQ) providing actionable insights for practical drone + VLM deployments.

The main benchmark implements RotationalEval, where we rotate images and their corresponding answers. A model must correctly answer all rotated versions of the same image to receive credit, preventing random guessing. RotationalEval activates automatically when results from multiple rotations (_rot*) are available. The results are saved in *_REresult.csv.

Names (for run.py):

Main dataset - 9 dimensions (MCQ)

tdbench_rot0
tdbench_rot90
tdbench_rot180
tdbench_rot270

Main dataset - grounding (VQA)

Note: —judge must be centroid or iou

tdbench_grounding_rot0
tdbench_grounding_rot90
tdbench_grounding_rot180
tdbench_grounding_rot270

Case Studies (MCQ)

tdbench_cs_zoom
tdbench_cs_height
tdbench_cs_integrity
tdbench_cs_depth

kennymckormick · 2025-04-25T13:58:29Z

Looks like there exists some conflicts that I'm not able to help resolve.

zhaomh1998 · 2025-04-25T15:43:43Z

@kennymckormick
The conflicts should be resolved now.

kennymckormick · 2025-04-28T13:42:12Z

Hi, @zhaomh1998 ,

Sorry that some new conflicts emerges after we merged some new PRs. Would you please grant me the permission so I can resolve the conflicts on my own?

By the way, I'm not sure how can we get the RotationEval results, since the four rotations are split into 4 separate TSV files.

zhaomh1998 · 2025-04-28T15:29:20Z

Hi @kennymckormick ,

I have resolved the conflicts and added you to the repository in case there are more conflicts.

RotationEval is executed in the evaluate function base on results from mcq_vanilla_eval (MCQ) / our own metrics (centroid for VQA). It looks for all x_acc.csv from all rotations and merge the results. RotationalEval results are printed and saved to x_REresult.csv.

We added usage instructions here.

Thanks!

kennymckormick · 2025-04-29T12:19:23Z

I have tested Gemini-1.5-Flash and GPT-4.1, the results look good, slightly better than the results reported in the paper.

Gemini-1.5-Flash:

GPT-4.1:

* [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint

* add vgrpbench * remove unnecessary files * [Improvement] Allow setting model name for lmdeploy wrapper (#913) Signed-off-by: Isotr0py <[email protected]> * [Minor] Add GPT-4.1 * [Fix] Fix function extract_subjective in dataset/creation.py (#911) * creation: extract_subjective * fix lint * [Fix] fix LA mode in HLE * [Fix] Fix COT Prompt BUG (#922) * [Patch] Bypass SSL (#923) * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Dataset ] add support for SAIL-VL-1.5 (#926) * 修改提交 * 提交名称修改 * 去除提交 * 去除提交 * 文件名修改 * 文件名修改 * 格式修复 --------- Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> * remove unnecessary file * [Fix] fix physics_yale with not using custom prompt in internvl series * [Model] Support SAIL-VL-1.6 (#939) Co-authored-by: qiuyan.kk <[email protected]> * [Minor] More info in tqdm progress bar (#937) * [Feature] Add vLLM support for Qwen2-VL/Qwen2.5-VL (#935) Co-authored-by: TianhaoLiang2000 <[email protected]> * [Benchmark] Support MMIFEval (#938) * add mmifeval * add req nltk * [Fix] update url and remove unnecessary log --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Model] Add support for Janus-Pro-1B (#945) add support for Janus-Pro-1B * [Minor] Patch to fix DynaMath preprocess * add vgrpbench * [Fix] Fix Lint * remove files * add vgrpbench's format json files, and update gitignore rule * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Benchmark] Add Support for Spatial457 Benchmark (CVPR 2025 Highlight) (#932) * update spatial457 * fix format * update readme * update README.md * update summarize.py * update dataset/__init__.py * update summarize.py * Revert image_vqa.py * add back spatial457 * Implement a more robust strategy for Spatial457 answer matching --------- Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Haodong Duan <[email protected]> * [Model] add support for Qwen2.5-Omni (#883) * add support for qwen2.5_omni * add support for qwen2.5_omni (only single process) * update model cls for qwen2_5omni * Delete VIDEO_DLC_scripts/MMSci_internvl2_8b.sh * Delete VIDEO_DLC_scripts/video_lb_update_cu118_smol.sh * Delete VIDEO_DLC_scripts/video_lb_update_qwen2_5_vl_7b.sh * Delete files * [Fix] Fix Lint --------- Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: kennymckormick <[email protected]> * [Benchmark] Support VisuLogic (#944) Co-authored-by: Haodong Duan <[email protected]> * [Minor] Support Gemini 2.5 Flash / Pro (#958) * [Minor] Add Explicit Format Instruction for AMBER (#961) * [Minor] Fix all_finished return null (#951) * [Benchmark] Support CVBench (CV-Bench-2D, CV-Bench-3D) (#909) * [Benchmark] Support CVBench, including CV-Bench-2D, CV-Bench-3D two sub tasks. * fix(image_mcq.py): prompt error * [Fix] Fix vllm with config (#953) * fix use config with vllm * fix * update * [Fix] Fix MM-IFEval & Custom Prompt in InternVL (#959) * [Model] Support SenseNova-V6-Pro (#964) * [Model] Support SenseNova-V6 * update model config * update * update * update config * [Benchmark] Support TDBench (#947) * [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint * [Fix] Refactor Task Launching Policy (#952) * Update run.py * [Refactor] Set CUDA_VISIBLE_DEVICES at the beginning * [Minor] auto / cuda device for several VLMs * [Doc] Update Doc * [Minor] Update CV-Bench URL * [Fix] Fix tmp ans load error in MM-IFEval (#969) * Fix tmp ans load error in MM-IFEval * Fix KeyError 0 * add vgrpbench * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> --------- Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Shengyuan Ding <[email protected]> Co-authored-by: Xinyu Fang <[email protected]> Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: suencgo <[email protected]> Co-authored-by: cmatachuan <[email protected]> Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> Co-authored-by: Xiangyu Zhao <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: Jiang Li <[email protected]> Co-authored-by: Xingrui Wang <[email protected]> Co-authored-by: xwy-bit <[email protected]> Co-authored-by: psp_dada <[email protected]> Co-authored-by: MaoSong2022 <[email protected]> Co-authored-by: Scott Zhao <[email protected]>

* [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint

* add vgrpbench * remove unnecessary files * [Improvement] Allow setting model name for lmdeploy wrapper (open-compass#913) Signed-off-by: Isotr0py <[email protected]> * [Minor] Add GPT-4.1 * [Fix] Fix function extract_subjective in dataset/creation.py (open-compass#911) * creation: extract_subjective * fix lint * [Fix] fix LA mode in HLE * [Fix] Fix COT Prompt BUG (open-compass#922) * [Patch] Bypass SSL (open-compass#923) * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (open-compass#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Dataset ] add support for SAIL-VL-1.5 (open-compass#926) * 修改提交 * 提交名称修改 * 去除提交 * 去除提交 * 文件名修改 * 文件名修改 * 格式修复 --------- Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> * remove unnecessary file * [Fix] fix physics_yale with not using custom prompt in internvl series * [Model] Support SAIL-VL-1.6 (open-compass#939) Co-authored-by: qiuyan.kk <[email protected]> * [Minor] More info in tqdm progress bar (open-compass#937) * [Feature] Add vLLM support for Qwen2-VL/Qwen2.5-VL (open-compass#935) Co-authored-by: TianhaoLiang2000 <[email protected]> * [Benchmark] Support MMIFEval (open-compass#938) * add mmifeval * add req nltk * [Fix] update url and remove unnecessary log --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Model] Add support for Janus-Pro-1B (open-compass#945) add support for Janus-Pro-1B * [Minor] Patch to fix DynaMath preprocess * add vgrpbench * [Fix] Fix Lint * remove files * add vgrpbench's format json files, and update gitignore rule * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (open-compass#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> * [Benchmark] Add Support for Spatial457 Benchmark (CVPR 2025 Highlight) (open-compass#932) * update spatial457 * fix format * update readme * update README.md * update summarize.py * update dataset/__init__.py * update summarize.py * Revert image_vqa.py * add back spatial457 * Implement a more robust strategy for Spatial457 answer matching --------- Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Haodong Duan <[email protected]> * [Model] add support for Qwen2.5-Omni (open-compass#883) * add support for qwen2.5_omni * add support for qwen2.5_omni (only single process) * update model cls for qwen2_5omni * Delete VIDEO_DLC_scripts/MMSci_internvl2_8b.sh * Delete VIDEO_DLC_scripts/video_lb_update_cu118_smol.sh * Delete VIDEO_DLC_scripts/video_lb_update_qwen2_5_vl_7b.sh * Delete files * [Fix] Fix Lint --------- Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: kennymckormick <[email protected]> * [Benchmark] Support VisuLogic (open-compass#944) Co-authored-by: Haodong Duan <[email protected]> * [Minor] Support Gemini 2.5 Flash / Pro (open-compass#958) * [Minor] Add Explicit Format Instruction for AMBER (open-compass#961) * [Minor] Fix all_finished return null (open-compass#951) * [Benchmark] Support CVBench (CV-Bench-2D, CV-Bench-3D) (open-compass#909) * [Benchmark] Support CVBench, including CV-Bench-2D, CV-Bench-3D two sub tasks. * fix(image_mcq.py): prompt error * [Fix] Fix vllm with config (open-compass#953) * fix use config with vllm * fix * update * [Fix] Fix MM-IFEval & Custom Prompt in InternVL (open-compass#959) * [Model] Support SenseNova-V6-Pro (open-compass#964) * [Model] Support SenseNova-V6 * update model config * update * update * update config * [Benchmark] Support TDBench (open-compass#947) * [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint * [Fix] Refactor Task Launching Policy (open-compass#952) * Update run.py * [Refactor] Set CUDA_VISIBLE_DEVICES at the beginning * [Minor] auto / cuda device for several VLMs * [Doc] Update Doc * [Minor] Update CV-Bench URL * [Fix] Fix tmp ans load error in MM-IFEval (open-compass#969) * Fix tmp ans load error in MM-IFEval * Fix KeyError 0 * add vgrpbench * [Benchmark] Add PHYSICS Benchmark for Open-Ended Physics Reasoning (open-compass#931) * add physic.py and update dataset logic * Initial commit:integrated physics prompt eval * fix lint * [Fix] update get judge model logic in physics dataset * edit the prompt in auxeval * fix auxeval in physices * fix lint --------- Co-authored-by: FangXinyu-0913 <[email protected]> --------- Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: kennymckormick <[email protected]> Co-authored-by: Shengyuan Ding <[email protected]> Co-authored-by: Xinyu Fang <[email protected]> Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: suencgo <[email protected]> Co-authored-by: cmatachuan <[email protected]> Co-authored-by: jinfeng.km <[email protected]> Co-authored-by: qiuyan.kk <[email protected]> Co-authored-by: Xiangyu Zhao <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: TianhaoLiang2000 <[email protected]> Co-authored-by: Jiang Li <[email protected]> Co-authored-by: Xingrui Wang <[email protected]> Co-authored-by: xwy-bit <[email protected]> Co-authored-by: psp_dada <[email protected]> Co-authored-by: MaoSong2022 <[email protected]> Co-authored-by: Scott Zhao <[email protected]>

zhaomh1998 added 3 commits April 23, 2025 20:18

[Benchmark] Add TDBench for top-down images

bfa0fc5

fix REresult symlink and index

3096e48

fix symlink

d1a5d61

Merge branch 'main' into tdbench

44cd97d

fix lint

13ab8ff

Merge branch 'main' into tdbench

02e291a

kennymckormick merged commit 3ba9220 into open-compass:main Apr 29, 2025
7 checks passed

kennymckormick pushed a commit to ryf1123/VLMEvalKit that referenced this pull request Apr 30, 2025

[Benchmark] Support TDBench (open-compass#947)

9062033

* [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint

Koii2k3 pushed a commit to wjnwjn59/VLMEvalKit that referenced this pull request Nov 13, 2025

[Benchmark] Support TDBench (open-compass#947)

c627e14

* [Benchmark] Add TDBench for top-down images * fix REresult symlink and index * fix symlink * fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Support TDBench#947

[Benchmark] Support TDBench#947
kennymckormick merged 6 commits intoopen-compass:mainfrom
Columbia-ICSL:tdbench

zhaomh1998 commented Apr 24, 2025

Uh oh!

kennymckormick commented Apr 25, 2025 •

edited

Loading

Uh oh!

zhaomh1998 commented Apr 25, 2025

Uh oh!

kennymckormick commented Apr 28, 2025

Uh oh!

zhaomh1998 commented Apr 28, 2025

Uh oh!

kennymckormick commented Apr 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhaomh1998 commented Apr 24, 2025

Names (for run.py):

Main dataset - 9 dimensions (MCQ)

Main dataset - grounding (VQA)

Case Studies (MCQ)

Uh oh!

kennymckormick commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaomh1998 commented Apr 25, 2025

Uh oh!

kennymckormick commented Apr 28, 2025

Uh oh!

zhaomh1998 commented Apr 28, 2025

Uh oh!

kennymckormick commented Apr 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kennymckormick commented Apr 25, 2025 •

edited

Loading