Juntao Jiang 1★ · Jiangning Zhang 1★ · Yali Bi 2 · Jinsheng Bai 1 · Weixuan Liu 3 · Weiwei Jin 4 · Zhucun Xue 1 · Yong Liu 1† · Xiaobin Hu 5 · Shuicheng Yan 5
1Zhejiang University
2University of Science and Technology of China
3East China Normal University
4Zhejiang Provincial People’s Hospital
5National University of Singapore
This repository is a comprehensive collection of resources for M3CoTBench, If you find any work missing or have any suggestions, feel free to pull requests or contact us. We will promptly add the missing papers to this repository.
Compared with existing multimodal medical benchmarks, our proposed M3CoTBench offers the following key advantages:
- Diverse Medical VQA Dataset.
We curate a 1,079-image medical visual question answering (VQA) dataset spanning 24 imaging modalities, stratified by difficulty and annotated with step-by-step reasoning aligned with real clinical diagnostic workflows. - Multidimensional CoT-Centric Evaluation Metrics.
We propose a comprehensive evaluation protocol that measures reasoning correctness, efficiency, impact, and consistency, enabling fine-grained and interpretable analysis of CoT behaviors across diverse MLLMs. - Comprehensive Model Analysis and Case Studies.
We benchmark both general-purpose and medical-domain MLLMs using quantitative metrics and in-depth qualitative case studies, revealing strengths and failure modes in clinical reasoning to guide future model design.
🤓 You can view the scores and comparisons of each method at M3CoTBench LeaderBoard.
- Introduction
- Highlight
- Data Pipeline
- Benchmark Overview
- Installation
- Usage
- Experiments
- Acknowlegments
- Citation
- Contact
Data acquisition and annotation pipeline of M3CoTBench. a) Carefully curated medical images from various public sources. b) Multi-type and multi-difficulty QA generation via LLMs and expert calibration. c) Structured annotation of key reasoning steps aligned with clinical diagnostic workflows.
Overview of M3CoTBench. Top: The benchmark covers 24 imaging modalities/examination types, 4 question types, and 13 clinical reasoning tasks. Middle: CoT annotation examples and 4 evaluation dimensions. Bottom: The distribution of image-QA pairs across a) modalities, b) question types, and c) tasks.
git clone https://github.com/juntaoJianggavin/M3CoTBench.git
This section provides access to the M3CoTBench Database, which contains the complete .png image data of M3CoTBench and a .xlsx file (the file provides the question, answer and annotated CoT steps in the M3CoTBench Database).
🥰You can download M3CoTBench Database to your local path using the following command:
huggingface-cli download --repo-type dataset --resume-download APRIL-AIGC/M3CoTBench --local-dir $YOUR_LOCAL_PATH
Then put the M3CoTBench.xlsx and images/ into the M3CoTBench/inference/datasets/
If you want to run your own model, you can follow the procedure below, but you’re also free to run the experiments in whichever way you prefer. One approach uses a CoT prompt, and the other directly outputs the results, which can then go straight into the evaluation section.
Enter the directory:
cd M3CoTBench/inference/
Note: Model weights should be placed in M3CoTBench/inference/pretrain. Each medical model requires its own specific Conda environment.
For HealthGPT:
conda env create -f ../environment/healthgpt_environment.yaml
conda activate M3CoTBench_healthgpt
bash models/medical_models/HealthGPT/llava/demo/run_batch_eval.sh
For HuatuoGPT-Vision:
conda env create -f ../environment/huatuo_environment.yaml
conda activate M3CoTBench_huatuo
# Direct Inference
python models/medical_models/HuatuoGPT-Vision/eval.py --run_direct
# CoT Inference
python models/medical_models/HuatuoGPT-Vision/eval.py --run_cot
For LLaVA-Med:
conda env create -f ../environment/llavamed_environment.yaml
conda activate M3CoTBench_llavamed
# Direct Inference
python models/medical_models/LLaVA-Med/llava/eval/model_vqa.py --mode direct
# CoT Inference
python models/medical_models/LLaVA-Med/llava/eval/model_vqa.py --mode cot
Note: Lingshu and MedGemma are integrated into the General Framework below.
Environment: M3CoTBench
conda env create -f ../environment/environment.yaml
conda activate M3CoTBench_env
(1) API Inference
# Start "GPT-5" on port xxxxx with 4 internal processes
bash scripts/run_api_model.sh "GPT-5" xxxxx 4
# Start "Claude-Sonnet-4.5" on port xxxxx (default 4 processes)
bash scripts/run_api_model.sh "Claude-Sonnet-4.5" xxxxx
(2) Local Inference
bash scripts/run_local_gpu_model.sh LLaVA-CoT 1,2,3,4,5,6 all xxxxx
To rerun failed inference data and update results:
cd M3CoTBench/inference/
# 1. Rerun failed files and merge into the original JSON
python reprocess_failed.py \
--input-file final_output/Lingshu-32B/Lingshu-32B_direct.json \
--model "Lingshu-32B" \
--data-path "dataset/M3CoTBench.xlsx" \
--image-dir "dataset/images" \
--update-in-place
# 2. Recalculate timing summary
python recalculate_summary.py \
--results-file final_output/Lingshu-32B/Lingshu-32B_direct.json \
--summary-file final_output/Lingshu-32B/Lingshu-32B_summary.json
Step 1: Merge Chain-of-Thought Fields.
Merge the CoT steps of the correct answers and convert the format to XLSX.
cd M3CoTBench/evaluation/
python combine_fields.py
Step 2: Reformat results.
Format Inference Results Batch format the inference JSON files into the evaluation output format (XLSX). This file will contain both the CoT of the correct answer and the predicted answer from the inference.
python tools/update_lmmseval_json.py
Step 3. Run Evaluation Scripts.
You can run metrics individually. For example, to evaluate recall:
bash scripts/recall.sh
bash scripts/precision.sh
Note: Simply update the data path for YOUR_MODEL_NAME inside recall.sh (or other script files).
After the GPT evaluation, you should see a cache/ directory structured as follows:
📂 cache
┣━━ 📂 recall
┃ ┗━━ 📂 YOUR_MODEL_NAME
┃ ┣━━ 📄 1.json
┃ ┣━━ 📄 2.json
┃ ┗━━ 📄 ...
┗━━ 📂 precision
┗━━ 📂 YOUR_MODEL_NAME
Step 4. Calculate Metrics for P, R and F1.
We cache the evaluation results for all questions in the cache directory. Here, we read results from the cache to calculate the final metrics.
For example, to calculate correctness.py:
python final_score/correctness.py --cache_dir cache --save_path final_results
The script will automatically calculate Recall and Precision, and then compute the F1 Score or Average Score.
Alternatively, you can calculate each metric individually. For example, to calculate Recall:
python final_score/recall.py --cache_dir cache/recall --save_path final_results
Then you can see a directory structured as follows:
📂final_results/
├─ 📂recall/
│ ├─ 📄recall_results.json
│ └─ 📄recall_errors.json
├─ 📂precision/
│ ├─ 📄precision_results.json
│ └─ 📄precision_errors.json
└─ 📂quality/
└─ 📄quality_results.json
The P, R, F1 scores are stored in "quality_results.json".
Step 5. Calculate Accuracies for the answers.
Evaluate the direct answer:
python scripts/accuracy.py \
--json_path "../inference/final_output/Qwen3-VL-30B-Thinking/Qwen3-VL-30B-Thinking_direct.json" \
--excel_path "../inference/dataset/M3CoTBench.xlsx" \
--output_path "Qwen3-VL-30B-Thinking_direct.json" \
--model "gpt-4o" \
--api_key "sk-your-api-key-here" \
--base_url "sk-your-api-url-here"
Evaluate the cot answer:
python scripts/accuracy.py \
--json_path "../inference/final_output/Qwen3-VL-30B-Thinking/Qwen3-VL-30B-Thinking_cot.json" \
--excel_path "../inference/dataset/M3CoTBench.xlsx" \
--output_path "Qwen3-VL-30B-Thinking_cot.json" \
--model "gpt-4o" \
--api_key "sk-your-api-key-here"\
--base_url "sk-your-api-url-here"
Then the impact score can be calculated.
Step 6. Calculate the Efficiency Metrics for the Output Steps.
The durations for direct and CoT inferences are in the summary output file (e.g. Claude-Sonnet-4.5_summary.json).
{
"cot": {
"total_item_count": 1079,
"successful_item_count": 1079,
"failed_item_count": 0,
"total_successful_time_s": 13598.991699999999,
"total_wasted_time_s": 75.924144786
},
"direct": {
"total_item_count": 1079,
"successful_item_count": 1079,
"failed_item_count": 0,
"total_successful_time_s": 5051.896599999994,
"total_wasted_time_s": 63.773742768000005
}
}
Then the Latency score can be obtained by: total_successful_time_s_cot/total_successful_time_s_direct= 13598.9917/5051.8966=2.69.
The total matched steps can be seen in the output file for recall scores in "recall_results.json: in "final_results/recall/".
"Claude-Sonnet-4.5": {
"overall_metrics": {
"average_recall": 0.5971,
"total_matched_steps": 2100
},
...
Then the Efficiency Score can be obtained by: total_matched_steps/total_successful_time_s_cot=2100/13598.9917=0.1544
Step 7. Calculate the Consistency Score for the Output Steps.
We use GPT-4o and Gemini-2.5-Pro for evaluation and average the results.
python scripts/batch_processor.py \
--result_dir "../inference/final_output/" \
--output_dir "./final_consistency_results_gpt4o" \
--question_file "../inference/dataset/M3CoTBench.xlsx" \
--api-base-url "sk-your-api-url-here" \
--api_key "sk-your-api-key-here" \
--model_name gpt-4o
python scripts/lcs_analyzer.py
--type_file "../inference/dataset/type.xlsx" \
--csv_dir "./final_consistency_results_gpt4o" \
--output_dir "./consistency_score_gpt4o"
python scripts/batch_processor.py \
--result_dir "../inference/final_output/" \
--output_dir "./final_consistency_results" \
--question_file "../inference/dataset/M3CoTBench.xlsx" \
--api-base-url "sk-your-api-url-here" \
--api_key "sk-your-api-key-here" \
--model_name gemini-2.5-pro
python scripts/lcs_analyzer.py
--type_file "../inference/dataset/type.xlsx" \
--csv_dir "./final_consistency_results_gemini" \
--output_dir "./consistency_score_gemini"
Then the consistency score will be in "./consistency_score_gpt4o/lcs_results/lcs_analyzer_zh_summary.csv" and "./consistency_score_gemini/lcs_results/lcs_analyzer_zh_summary.csv".The column named "overall_average_similarity" contains the consistency score. You only need to average the two results.
Metrics: ↑ Higher is Better, ↓ Lower is Better. Bold: Best result.
| # | Model | Category | Correctness (↑) | Impact (↑) | Efficiency | Consistency Cpath (↑) |
|||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | P | R | Accdir | Accstep | I | E (↑) | L (↓) | ||||
| 1 | LLava-CoT | Open-source | 49.80 | 54.08 | 46.15 | 40.08 | 36.75 | -3.33 | 0.06 | 1.56 | 77.02 |
| 2 | InternVL3.5-8B | Open-source | 56.48 | 60.61 | 52.88 | 56.81 | 53.61 | -3.20 | 0.10 | 18.27 | 71.65 |
| 3 | InternVL3.5-30B | Open-source | 59.42 | 62.15 | 56.92 | 63.81 | 57.60 | -6.21 | 0.03 | 16.68 | 76.30 |
| 4 | Qwen3-VL-Instruct-8B | Open-source | 55.17 | 52.74 | 57.84 | 51.30 | 46.62 | -4.68 | 0.04 | 93.94 | 82.65 |
| 5 | Qwen3-VL-Instruct-30B | Open-source | 59.15 | 56.13 | 62.51 | 54.63 | 51.39 | -3.24 | 0.03 | 35.63 | 83.01 |
| 6 | Qwen3-VL-Thinking-8B | Open-source | 59.87 | 59.84 | 59.91 | 48.33 | 52.83 | +4.50 | 0.02 | 2.79 | 76.91 |
| 7 | Qwen3-VL-Thinking-30B | Open-source | 62.15 | 63.34 | 61.01 | 51.90 | 55.47 | +3.57 | 0.02 | 1.15 | 76.02 |
| 8 | GPT-4.1 | Closed-source | 60.76 | 58.32 | 63.42 | 56.77 | 57.97 | +1.22 | 0.17 | 5.08 | 81.31 |
| 9 | GPT-5 | Closed-source | 55.13 | 64.15 | 48.34 | 58.76 | 58.29 | -0.47 | 0.06 | 1.10 | 65.39 |
| 10 | Gemini 2.5 Pro | Closed-source | 66.07 | 62.48 | 70.10 | 60.24 | 60.06 | -0.18 | 0.10 | 1.52 | 82.00 |
| 11 | Claude-Sonnet-4.5 | Closed-source | 56.50 | 53.62 | 59.71 | 51.25 | 51.07 | -0.18 | 0.15 | 2.69 | 85.22 |
| 12 | LLaVA-Med (7B) | Medical | 30.51 | 36.33 | 26.30 | 29.38 | 29.29 | -0.09 | 0.35 | 3.22 | 72.68 |
| 13 | HuatuoGPT-Vision (7B) | Medical | 49.45 | 51.17 | 47.85 | 41.89 | 34.94 | -6.95 | 0.21 | 5.92 | 73.19 |
| 14 | HealthGPT (3.8B) | Medical | 32.56 | 47.27 | 24.83 | 44.11 | 41.98 | -2.13 | 0.06 | 15.36 | 67.72 |
| 15 | Lingshu-7B | Medical | 57.57 | 63.96 | 52.34 | 50.00 | 42.08 | -7.92 | 0.30 | 8.37 | 74.83 |
| 16 | Lingshu-32B | Medical | 59.16 | 65.68 | 53.82 | 51.77 | 44.95 | -6.82 | 0.21 | 10.87 | 71.47 |
| 17 | MedGemma-4B | Medical | 48.13 | 50.29 | 46.14 | 43.33 | 41.29 | -2.04 | 0.05 | 20.61 | 74.03 |
| 18 | MedGemma-27B | Medical | 50.98 | 48.33 | 53.81 | 46.06 | 45.88 | -0.18 | 0.03 | 23.71 | 82.55 |
We would like to acknowledge that some parts of the code were inspired by and referenced from MME-CoT.
If you find M3CoTBench useful for your research, please consider giving a star⭐ and citation📝 :)
@misc{jiang2026m3cotbenchbenchmarkchainofthoughtmllms,
title={M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding},
author={Juntao Jiang and Jiangning Zhang and Yali Bi and Jinsheng Bai and Weixuan Liu and Weiwei Jin and Zhucun Xue and Yong Liu and Xiaobin Hu and Shuicheng Yan},
year={2026},
eprint={2601.08758},
archivePrefix={arXiv},
primaryClass={eess.IV},
url={https://arxiv.org/abs/2601.08758},
}

