MLLM As A Judge - 2402.04788v3
MLLM As A Judge - 2402.04788v3
Dongping Chen * 1 Ruoxi Chen * 2 Shilin Zhang * 1 Yaochen Wang * 1 Yinuo Liu * 1 Huichi Zhou * 1
Qihui Zhang * 1 Yao Wan 1 Pan Zhou 1 Lichao Sun 3
Abstract 1. Introduction
arXiv:2402.04788v3 [[Link]] 11 Jun 2024
1
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Figure 1. Comparative performance of different MLLMs across three judging settings in 10 datasets, each is the average of three iterations.
As the CogVLM is unable to perform the batch ranking task, we show the other six MLLMs only.
(OpenAI, 2023), Gemini (GeminiTeam, 2023)1 , LLaVA-1.5- Take-Aways. We evaluate the judgment performance of 11
13b, LLaVA-1.6-34b (Liu et al., 2023d), CogVLM (Wang MLLMs across 14 datasets under three settings: score eval-
et al., 2023c), Qwen-VL-Max (Bai et al., 2023a), to generate uation, pair comparison, and batch ranking. Our findings
responses to each instruction across three distinct evaluation reveal several key insights. First, while MLLMs demon-
settings. The produced responses are subsequently gathered strate proficiency in aligning with human preferences in pair
and undergo additional annotation by human evaluators, comparison tasks, they require further improvement in score
who apply stringent criteria to ensure an impartial and thor- evaluation and batch ranking, particularly in reasoning tasks.
ough assessment of the judgments made by the MLLMs. Secondly, GPT-4V consistently outperforms other models
across all tasks and settings.
Furthermore, we assess the ability of MLLMs as judges
in multimodal tasks by calculating the similarity between Finally, the presence of hallucinations, biases, and inconsis-
human and MLLMs judgment and measuring human agree- tent judgments in MLLMs highlights significant challenges
ment on the analysis and judgment made by those MLLMs. that must be addressed for these models to become a viable
In particular, we target eleven widely-used MLLMs, i.e., alternative to traditional human evaluations.
GPT-4V and Gemini-Pro-1.0/1.5, CogVLM, LLaVA-1.5/1.6
To summarize, our work provides three key contributions:
family, and Qwen-VL family, across two settings (with, or
without vision input), over three distinct tasks (i.e., Scoring • A Benchmark. We are the first to develop a compre-
Evaluation, Pair Comparison, and Batch Ranking). Figure 1 hensive benchmark MLLM-as-a-Judge in multimodal do-
compares the performance of various MLLMs across differ- mains, with human annotations to assess the judging ca-
ent datasets and settings, illustrating that GPT-4V exhibits pability of MLLMs in tasks of Scoring Evaluation, Pair
significantly superior capabilities as a judge compared to Comparison and Batch Ranking.
other MLLMs. • Two Datasets. We curate two human preference datasets:
MLLM- AS - A -J UDGE -HQ, which contains high-quality
As a benchmark, we also release two curated datasets to questions, and MLLM- AS - A -J UDGE -HARD, which in-
facilitate further studies: MLLM- AS - A -J UDGE -HQ, which cludes instances of hallucination. These datasets can serve
showcases responses with a high level of concordance as rigorous testing grounds to facilitate the development
with human judgments, and MLLM- AS - A -J UDGE -H ARD, of MLLMs in aligning human preferences.
which includes responses marked by inconsistency with • Findings and Implications. Our evaluation of main-
human preferences and instances of hallucination. Addi- stream MLLMs reveals that while MLLMs exhibit align-
tionally, we address the limitations of MLLMs in judgment, ment with human judgments in Pair Comparison, no-
such as egocentric bias, position bias, length bias, and hallu- table discrepancies can be found in Scoring Evaluation
cination. We demonstrate that integrating CoT (Wei et al., and Batch Ranking. Furthermore, our findings reveal
2022) and a vision expert system can effectively mitigate that MLLMs exhibit a range of biases and hallucinations,
some of these biases. along with inconsistent judgments during the evaluation
1
For conciseness, we refer to GPT-4V(ision) as GPT-4V, and process, representing significant hurdles in establishing
Gemini-Pro-Vision as Gemini throughout this paper. MLLMs as reliable judges.
2
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Instruction Images
Sequential MLLM response
images
Analyze: The image depicts a rainy night
What is in a bustling city, with people … Scoring Pair Batch
unusual about
this picture? evaluation comparison ranking
To determine the number of
people who lived in …. Assistant A:
Random The year is….
… Assistant A: Assistant B:
Sample
The largest bar in the figure The number As for the
has a value of 90. …
Assistant A:
is…. year….
The answer
Assistant B: Assistant C:
is….
As for the The answer is...
number…. Assistant D:
Judgement: 4
The year is ….
Judgement: B
What is the Area of CHD? Describe this image.
Judgement:
… … CBAD
… …
3
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
5 H V S R Q V H &