Logo EndoBench

A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

Shengyuan Liu*1, Boyun Zheng*,1, Wenting Chen*2, Zhihao Peng1,
Zhenfei Yin3, Jing Shao4, Jiancong Hu5, Yixuan Yuan†,1,

1The Chinese University of Hong Kong, 2City University of Hong Kong 3University of Oxford 4Shanghai AI Laboratory, 5The Sixth Affiliated Hospital, Sun Yat-sen University

*Core Contributors
†Corresponding to: [email protected],

News

• 🎉 We are excited to announce that EndoBench has been accepted to NeurIPS 2025 Track on Datasets and Benchmarks.

Highlight

1. We introduce EndoBench, the first comprehensive benchmark specifically designed to evaluate MLLMs across the complete spectrum of endoscopy, covering 4 endoscopic scenarios, 12 specialized tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities.

2. We develop the multi-dimensional evaluation framework that mirrors the clinical workflow progression from basic anatomical recognition to advanced surgical intervention, assessing MLLMs' capabilities across the full spectrum of endoscopic analysis skills.

3. We conduct the extensive comparative evaluation of 23 MLLMs (13 open-source general-purpose, 5 medical-specialized, and 5 proprietary models) against human clinician performance, providing insights into current model capabilities.

Abstract

EndoBench dataset

Endoscopic procedures are essential for diagnosing and treating internal diseases, and multi-modal large language models (MLLMs) are increasingly applied to assist in endoscopy analysis. However, current benchmarks are limited, as they typically cover specific endoscopic scenarios and a small set of clinical tasks, failing to capture the real-world diversity of endoscopic modalities and the full range of skills needed in clinical workflows. To address these issues, we introduce EndoBench, the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice with multi-dimensional capacities. EndoBenchencompasses 4 distinct endoscopic modalities, 12 specialized clinical tasks with 12 secondary subtasks, and 5 levels of visual prompting granularities, resulting in 6,832 rigorously validated VQA pairs from 21 diverse datasets. Our multi-dimensional evaluation framework mirrors the clinical workflow—spanning anatomical recognition, lesion analysis, spatial localization, and surgical operations—to holistically gauge the perceptual and diagnostic abilities of MLLMs in realistic scenarios. We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs, and establish human clinician performance as a reference standard. Our extensive experiments reveal: (1) proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts; (2) medical-domain supervised fine-tuning substantially boosts task-specific accuracy; and (3) model performance remains sensitive to prompt format and clinical task complexity. EndoBench establishes a new standard for evaluating and advancing MLLMs in endoscopy, highlighting both progress and persistent gaps between current models and expert clinical reasoning.

Statistics

Construction Process

algebraic reasoning

Data construction process of EndoBench, consisting of (a) data collection, (b) QA standardization, and (c) data filtering. Finally, we implement (d) model evaluation on Endobench.

Experiment Results

Results of different MLLMs on 12 clinical tasks.

algebraic reasoning

Table 1: Results of different MLLMs on 12 clinical tasks in EndoBench. The best-performing model in each category is in-bold, and the second best is underlined.

Results of different MLLMs on 4 different endoscopy scenarios and 4 different visual prompts.

algebraic reasoning

Table 2: Results of different MLLMs on 4 different endoscopy scenarios and 4 different visual prompts in EndoBench. The best-performing model in each category is in-bold, and the second best is underlined.

Results of different MLLMs on 12 subtasks in EndoBench.

algebraic reasoning

Table 3: Results of different MLLMs on 12 subtasks in EndoBench.

Performance comparison of several leading MLLMs and Clinicians.

algebraic reasoning

Figure 1: Performance comparison of several leading MLLMs and Clinicians.

Performance comparison across four major categories.

algebraic reasoning

Figure 2: Performance comparison across 4 major categories in EndoBench among existing MLLMs.

Performance comparison across four endoscopic scenarios.

algebraic reasoning

Figure 3: Performance comparison across 4 endoscopic scenarios in EndoBench among existing MLLMs.

Performance comparison across five different visual prompts.

algebraic reasoning

Figure 4: Performance comparison across 5 different visual prompts in EndoBench among existing MLLMs.

Analysis

Observation 1: Endoscopy remains a challenging domain for MLLMs, with significant gaps between models and human expertise. Human experts achieve an average accuracy of 74.12% in endoscopy tasks, while the top-performing model, Gemini-2.5-Pro, reaches only 49.53%—a gap of roughly 25%. This highlights the inherent difficulty of endoscopy, which demands both precise visual interpretation and specialized medical knowledge. Proprietary models consistently outperform open-source models overall, yet open-source models show a surprising edge in surgical scenarios, where their accuracy improves markedly compared to random baselines. In contrast, for non-surgical tasks like landmark and organ identification, open-source models perform no better than random guessing. This disparity suggests that while open-source models can leverage structured contexts, they falter in knowledge-intensive tasks, pointing to a need for enhanced domain-specific capabilities.

Observation 2: Medical domain-specific Supervised Fine-Tuning markedly boosts model performance. Medical models that underwent domain-specific supervised fine-tuning, such as MedDr and HuatuoGPT-Vision-34B, perform exceptionally well in tasks like landmark identification and organ recognition, even outperforming all proprietary models. This indicates that domain pretraining effectively equips models with essential medical knowledge, enhancing their competitiveness in specialized tasks. However, some medical models exhibit limitations in instruction-following capabilities and suffer from overfitting, which restricts their performance in broader application scenarios. This suggests that while conducting domain-specific training, greater attention should be paid to balancing model generalization and task adaptability.

error distribution

Figure 5: The influence of visual prompt in lesion quantification task among different MLLMs.

Observation 3: Model performance varies with visual prompt formats, exposing a gap between visual perception and medical comprehension. The ability of models to understand spatial information varies significantly based on how visual prompts are formatted, rather than being consistently robust across different scenarios. To explore this, we test the same images across 3 tasks with different visual prompts, as shown in Figure \ref{fig:visual_prompt}. The results in Table \ref{tab:comparison} and Table \ref{tab:comparison_scene} reveal that most models, especially proprietary ones, excelled in the ROI Selection task, indicating strong visual comprehension in distinguishing between regions. However, they struggled to accurately classify lesion types within those regions, pointing to a lack of medical knowledge as the main source of errors rather than poor visual processing. This suggests that while models can spatially differentiate areas, their interpretation hinges on both the prompt format and their limited medical expertise. Ultimately, models’ spatial understanding is not broadly applicable but depends heavily on prompt structure, with insufficient medical knowledge acting as a key limitation.

Observation 4: Polyp counting exposes dual challenges in lesion segmentation and numerical reasoning. Polyp counting, a task that requires both spatial localization of lesions and numerical reasoning, remains challenging for most models, with all models achieving accuracy below 30%. To further analyze the sources of model errors, we introduced a new visual prompt format (Figure \ref{fig:polyp_case}), which led to modest improvements in accuracy across models. Notably, Gemini-2.5-Pro achieved a remarkable accuracy of 92% under this new prompting strategy. This significant improvement suggests that Gemini possesses strong capabilities in spatial recognition and counting, indicating that the primary limitation across models lies not in computational or spatial reasoning but rather in lesion identification. This finding underscores the critical need to enhance the integration of domain-specific medical knowledge in vision-language models to better address tasks that combine visual analysis with clinical understanding.

polyp counting results

Figure 6: The influence of visual prompt in lesion quantification task among different MLLMs.

Case Study

In this section, we present a case study analyzing the performance of multiple Multimodal Large Language Models (MLLMs) on EndoBench across various endoscopic scenarios. In addition to showcasing correct responses, we categorize errors into three distinct types: Perceptual Errors, Lack of Knowledge, Irrelevant Response, and Refusal to Answer. The following figures illustrate these case studies: correct samples are presented in Figures 7 through 12, while error samples are shown in Figures 13 through 20.

Correct Samples (Figures 7–12): These figures highlight exemplary performances by leading models such as Gemini-2.5-Pro and GPT-4o. These models demonstrate robust capabilities in accurately interpreting endoscopic images and providing clinically relevant responses, underscoring their potential for assisting in real-world endoscopic analysis.

Error Analysis: Errors observed in the case studies are classified into four categories, each revealing specific limitations in MLLM performance:

  • Perceptual Errors (Figure 13 and 14): MLLMs may struggle to accurately perceive or interpret visual information in images, including failing to detect critical objects, misidentifying elements, or overlooking essential details. In Fig. 13, QvQ-72B fails to recognize erythematous areas and focuses on irrelevant yellow-white granules. Similarly, in Fig. 14, HuatuoGPT-Vision-34B overlooks that the mucosa has been stained blue, leading to an incorrect interpretation of the scene. These indicate a limitation in the model’s ability to accurately recognize clinically significant visual patterns.
  • Lack of Knowledge (Figure 15 and 16): MLLMs may accurately identify visual elements in an image and comprehend the question but still provide incorrect answers due to insufficient medical domain expertise. This manifests as misinterpretations of clinical signs or failure to differentiate between similar medical conditions. For instance, in Fig. 15, QvQ-72B correctly identifies low-level visual features, such as red points in the image, but misinterprets them as blood vessels. Similarly, in Fig. 16, HuatuoGPT-Vision-34B notices prominent bright red areas in the image during reasoning but fails to interpret them as bleeding, leading to an inaccurate response. These errors highlight a deficiency in domain-specific medical knowledge, where the model fails to contextualize visual cues with appropriate clinical understanding.
  • Irrelevant Response (Figures 17 and 18): MLLMs sometimes generate responses that are unrelated to the user’s query, producing irrelevant, incomplete, or incomprehensible information that fails to address the question. For example, in Fig. 17, LLaVA-Med is asked to determine the number of surgical instruments in an endoscopic image but outputs a tautological restatement of the query, lacking any clinical insight. In another case, Fig. 18, ColonGPT is tasked with classifying pathological findings in an endoscopic image but outputs a term unrelated to the provided options and observed pathology. These case studies emphasize the need for improved medical knowledge integration and enhanced perceptual capabilities to bridge the gap between current MLLM performance and clinical requirements.
  • Refusal to Answer (Figures 19 and 20): Certain MLLMs, particularly proprietary ones, are designed to decline responses to questions involving sensitive information, ethical dilemmas, or requiring professional medical advice to ensure safety and compliance. For example, in Fig. 19, GPT-4o is asked to identify the coordinates of a low-grade adenoma in an endoscopic image but states it is unable to provide the coordinates. Likewise, in Fig. 20, Grok-3 is tasked with counting surgical instruments in an endoscopic image but explicitly refuses, citing its inability to process such requests. These cases highlight the need for enhanced technical capabilities and clearer ethical guidelines to balance safety with clinical utility in MLLM responses.

These case studies emphasize the need for improved medical knowledge integration and enhanced perceptual capabilities to bridge the gap between current MLLM performance and clinical requirements.

Case Study 01
Figure 7: Correct sample from Gemini-2.5-Pro.
Case Study 02
Figure 8: Correct sample from Gemini-2.5-Pro.
Case Study 03
Figure 9: Correct sample from Gemini-2.5-Pro.
Case Study 04
Figure 10: Correct sample from GPT-4o.
Case Study 05
Figure 11: Correct sample from GPT-4o.
Case Study 06
Figure 12: Correct sample from GPT-4o.
Case Study 07
Figure 13: Error sample demonstrating Perceptual Errors (QvQ-72B).
Case Study 08
Figure 14: Error sample demonstrating Perceptual Errors (HuatuoGPT-Vision-34B).
Case Study 09
Figure 15: Error sample demonstrating Lack of Knowledge (QvQ-72B).
Case Study 10
Figure 16: Error sample demonstrating Lack of Knowledge (HuatuoGPT-Vision-34B).
Case Study 11
Figure 17: Error sample demonstrating Irrelevant Response (LLaVA-Med).
Case Study 12
Figure 18: Error sample demonstrating Irrelevant Response (ColonGPT).
Case Study 13
Figure 19: Error sample demonstrating Refusal to Answer (GPT-4o).
Case Study 14
Figure 20: Error sample demonstrating Refusal to Answer (Grok-3).