Xuanwen Ding1,3* , Chengjun Pan1* , Zejun Li1* , Jiwen Zhang1* ,
Siyuan Wang2 , Zhongyu Wei1,3†.
1Fudan University, Shanghai China
2University of Southern California, Los Angeles, USA
3Shanghai Innovation Institute, Shanghai, China
* Equal Contribution, † Corresponding Author
AutoJudger is an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model’s real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process.Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.
AutoJudger/
├── VLMEvalKit/ # Model evaluation tool
├── LMUData/ # Raw benchmark files (i.e. tsv)
├── models/ # Judging agent model weights (i.e. Qwen2.5-VL-7B-Instruct)
├── model_performance/ # Model evaluation records
│ └── SEEDBench_IMG/ # Example: SEEDBench-specific model responses
├── data/ # Processed benchmark info (splits, difficulty scores)
│ └── SEEDBench_IMG/ # Example: SEEDBench difficulty scores
├── clip_features/ # CLIP embeddings for all questions
│ └── clip_models/ # Downloaded clip model weights
├── init/ # Initial 10 seed questions (CLIP-based clustering)
└── out_folder/ # Output results
Git clone our repository, via the following command:
git clone [email protected]:IMNearth/AutoJudger.git
cd AutoJudger
conda create -n autojudger python=3.11
pip install -r requirements.txtInstall VLMEvalKit by following the instructions provided by QuickStart.
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .Download the original benchmark files (i.e. raw TSV files with questions, images, options, answers) and place them in the LMUData/ directory:
- SEEDBench_IMG.tsv →
LMUData/SEEDBench_IMG.tsv - AI2D_TEST.tsv →
LMUData/AI2D_TEST.tsv - MMMU_DEV_VAL.tsv →
LMUData/MMMU_DEV_VAL.tsv - MMT-Bench_VAL.tsv →
LMUData/MMT-Bench_VAL.tsv
Collect the model response records on these benchmarks and put them in the model_performance/.
- We thank the VLMEvalKit for providing such a convenient tool to collect these records.
- Here is a example about how to collect model responses on
SEEDBench_IMG:
cd VLMEvalKit
torchrun --nproc-per-node=4 run.py --data SEEDBench_IMG --model GPT4o Qwen2.5-VL-7B-Instruct --verbose- After collect the model responses, we have to transform these raw responses into a table. Below is an example of the transformed response records on the
SEEDBench_IMGbenchmark:
| model_sha | 360VL-70B | Aquila-VL-2B | ... | xgen-mm-phi3-interleave-r-v1.5 |
|---|---|---|---|---|
| 39 | 1 | 1 | ... | 1 |
| ... | ... | ... | ... | ... |
| 117 | 0 | 1 | ... | 0 |
| ... | ... | ... | ... | ... |
| 106418 | 1 | 1 | ... | 1 |
Here, rows of the first column indicate question IDs, and the subsequent columns correspond to different model names. The values 0 and 1 indicate whether a model answered the question incorrectly or correctly, respectively. Therefore, each row represents whether the response record of different models for this question is correct or incorrect.
We have already provided the estimated question difficulty scores for SEEDBench_IMG, AI2D_TEST, MMMU_DEV_VAL and MMT-Bench_VAL in data/question_diff.zip.
Note that, you can generate IRT-estimated difficulty scores for your own dataset by running
python prepare_dataset.py --benchmark YOUR_DATASETThe scripts will create data/YOUR_DATASET/train directory and stores the following files
└── YOUR_DATASET
└── train
├── train_model_df.json # estimated model abilities
├── train_model_list.json # list of models used to estimate the difficulties
└── train_prob_df.json # stimated question difficulty
Besides, under this folder, there also exists
- Model list and benchmark metadata (
data/model_information_new.csv) - Train/test splits (
data/test_model_list.json)
Generate CLIP embeddings for the evaluation benchmark by using the following command:
cd ./clip_features
python clip.py --benchmark AI2D_TEST --mode multimodal --fusion_method concat --save_dir ./ --download_root ./clip_modelsThis script generates embeddings (text, image, or multimodal) using the CLIP model. It supports flexible processing modes and fusion methods for combining text and image embeddings. Here is the explanations for the arguments:
--benchmark (str): Set the benchmark name that are supported in VLMEvalKit.--mode (str, default to 'multimodal', choices are ['text', 'image', 'multimodal']): Whenmodeset to "multimodal", both text and image embeddings are used to construct the feature for questions.--fusion_method (str, default to 'concat', choices are ['mean', 'sum', 'concat']): Whenfusion_methodset to "concat", we use the concated embeddings from text and image as the question feature.--download_root (str): Where to save the clip models, the models will be automatically downloaded by callingclip.load(...).--save_dir (str): Where to save the processed embeddings for questions.
Download Qwen2.5-VL-7B-Instruct, via the following command:
mkdir models
cd ./models
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-InstructTo launch an adaptive evaluation on a specific benchmark, you should run script main.py. This script accepts several command-line arguments to control model evaluation and benchmarking behavior:
agent_model (str, default='Qwen2.5-VL-7B-Instruct'): Specifies the name of the judging model (or "agent model") used for automatic evaluation.test_model (list of str, optional): One or more names of the models to be evaluated. You can pass multiple model names separated by spaces (e.g., --test_model modelA modelB).benchmark (str, default='SEEDBench_IMG'): Sets the benchmark dataset to use for evaluation.feature (str, default='text'): Indicates the type of feature representation to use. Options include 'text' -- Use only text-based features; 'image' -- Use only image-based features; 'multimean' -- Use the mean of text and image features, 'multiconcat' -- Use concatenated text and image features.root_path (str, default='./'): Specifies the root directory of the project.include_text (flag): If provided, includes textual content in the evaluation process.include_image (flag): If provided, includes visual content in the evaluation process.
Here is an example that using Qwen2.5-VL-7B-Instruct as the judging agent to evaluate MiniCPM-V-2 and Slime-7B on SEEDBench_IMG.
python main.py --agent_model Qwen2.5-VL-7B-Instruct --test_model MiniCPM-V-2 Slime-7B --benchmark SEEDBench_IMG --include_text --include_imageIf AutoJudger has been beneficial to your research and work, please cite our work using the following format:
@misc{ding2025autojudgeragentdrivenframeworkefficient,
title={AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs},
author={Xuanwen Ding and Chengjun Pan and Zejun Li and Jiwen Zhang and Siyuan Wang and Zhongyu Wei},
year={2025},
eprint={2505.21389},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.21389},
}
