😊Continuous Updates

Juntao Jiang ^1★ · Jiangning Zhang ^1★ · Yali Bi ² · Jinsheng Bai ¹ · Weixuan Liu ³ · Weiwei Jin ⁴ · Zhucun Xue ¹ · Yong Liu ^1† · Xiaobin Hu ⁵ · Shuicheng Yan ⁵

¹Zhejiang University ²University of Science and Technology of China ³East China Normal University
⁴Zhejiang Provincial People’s Hospital ⁵National University of Singapore

😊Continuous Updates

This repository is a comprehensive collection of resources for M3CoTBench, If you find any work missing or have any suggestions, feel free to pull requests or contact us. We will promptly add the missing papers to this repository.

✨ Highlight!!!

Compared with existing multimodal medical benchmarks, our proposed M3CoTBench offers the following key advantages:

Diverse Medical VQA Dataset.
We curate a 1,079-image medical visual question answering (VQA) dataset spanning 24 imaging modalities, stratified by difficulty and annotated with step-by-step reasoning aligned with real clinical diagnostic workflows.
Multidimensional CoT-Centric Evaluation Metrics.
We propose a comprehensive evaluation protocol that measures reasoning correctness, efficiency, impact, and consistency, enabling fine-grained and interpretable analysis of CoT behaviors across diverse MLLMs.
Comprehensive Model Analysis and Case Studies.
We benchmark both general-purpose and medical-domain MLLMs using quantitative metrics and in-depth qualitative case studies, revealing strengths and failure modes in clinical reasoning to guide future model design.

🤓 You can view the scores and comparisons of each method at M3CoTBench LeaderBoard.

📬Summary of Contents

Introduction
Highlight
Data Pipeline
Benchmark Overview
Installation
- Install requirements
- Download M3CoTBench Database
Usage
- Inference
- Evaluation
Experiments
Acknowlegments
Citation
Contact

🔬 Data Pipeline

Data acquisition and annotation pipeline of M3CoTBench. a) Carefully curated medical images from various public sources. b) Multi-type and multi-difficulty QA generation via LLMs and expert calibration. c) Structured annotation of key reasoning steps aligned with clinical diagnostic workflows.

🌻 Benchmark Overview

Overview of M3CoTBench. Top: The benchmark covers 24 imaging modalities/examination types, 4 question types, and 13 clinical reasoning tasks. Middle: CoT annotation examples and 4 evaluation dimensions. Bottom: The distribution of image-QA pairs across a) modalities, b) question types, and c) tasks.

🔨Installation

1. Install requirements

git clone https://github.com/juntaoJianggavin/M3CoTBench.git

2. Downloads the M3CoTBench Database

This section provides access to the M3CoTBench Database, which contains the complete .png image data of M3CoTBench and a .xlsx file (the file provides the question, answer and annotated CoT steps in the M3CoTBench Database).

🥰You can download M3CoTBench Database to your local path using the following command:

huggingface-cli download --repo-type dataset --resume-download APRIL-AIGC/M3CoTBench --local-dir $YOUR_LOCAL_PATH

Then put the M3CoTBench.xlsx and images/ into the M3CoTBench/inference/datasets/

💪Usage

3. Inference

If you want to run your own model, you can follow the procedure below, but you’re also free to run the experiments in whichever way you prefer. One approach uses a CoT prompt, and the other directly outputs the results, which can then go straight into the evaluation section.

Enter the directory:

cd M3CoTBench/inference/

Specialized Medical Models

Note: Model weights should be placed in M3CoTBench/inference/pretrain. Each medical model requires its own specific Conda environment.

For HealthGPT:

conda env create -f ../environment/healthgpt_environment.yaml
conda activate M3CoTBench_healthgpt
bash  models/medical_models/HealthGPT/llava/demo/run_batch_eval.sh

For HuatuoGPT-Vision:

conda env create -f ../environment/huatuo_environment.yaml
conda activate M3CoTBench_huatuo
# Direct Inference
python models/medical_models/HuatuoGPT-Vision/eval.py --run_direct
# CoT Inference
python models/medical_models/HuatuoGPT-Vision/eval.py --run_cot

For LLaVA-Med:

conda env create -f ../environment/llavamed_environment.yaml
conda activate M3CoTBench_llavamed
# Direct Inference
python models/medical_models/LLaVA-Med/llava/eval/model_vqa.py --mode direct
# CoT Inference
python models/medical_models/LLaVA-Med/llava/eval/model_vqa.py --mode cot

Note: Lingshu and MedGemma are integrated into the General Framework below.

General Framework

Environment: M3CoTBench

conda env create -f ../environment/environment.yaml
conda activate M3CoTBench_env

(1) API Inference

# Start "GPT-5" on port xxxxx with 4 internal processes
bash scripts/run_api_model.sh "GPT-5" xxxxx 4

# Start "Claude-Sonnet-4.5" on port xxxxx (default 4 processes)
bash scripts/run_api_model.sh "Claude-Sonnet-4.5" xxxxx

(2) Local Inference

bash scripts/run_local_gpu_model.sh LLaVA-CoT 1,2,3,4,5,6 all xxxxx

To rerun failed inference data and update results:

cd M3CoTBench/inference/
# 1. Rerun failed files and merge into the original JSON
python reprocess_failed.py \
    --input-file final_output/Lingshu-32B/Lingshu-32B_direct.json \
    --model "Lingshu-32B" \
    --data-path "dataset/M3CoTBench.xlsx" \
    --image-dir "dataset/images" \
    --update-in-place

# 2. Recalculate timing summary
python recalculate_summary.py \
    --results-file final_output/Lingshu-32B/Lingshu-32B_direct.json \
    --summary-file final_output/Lingshu-32B/Lingshu-32B_summary.json

4. Evaluation

Step 1: Merge Chain-of-Thought Fields.

Merge the CoT steps of the correct answers and convert the format to XLSX.

cd M3CoTBench/evaluation/
python combine_fields.py

Step 2: Reformat results.

Format Inference Results Batch format the inference JSON files into the evaluation output format (XLSX). This file will contain both the CoT of the correct answer and the predicted answer from the inference.

python tools/update_lmmseval_json.py

Step 3. Run Evaluation Scripts.

You can run metrics individually. For example, to evaluate recall:

bash scripts/recall.sh
bash scripts/precision.sh

Note: Simply update the data path for YOUR_MODEL_NAME inside recall.sh (or other script files).

After the GPT evaluation, you should see a cache/ directory structured as follows:

📂 cache
 ┣━━ 📂 recall
 ┃    ┗━━ 📂 YOUR_MODEL_NAME
 ┃         ┣━━ 📄 1.json
 ┃         ┣━━ 📄 2.json
 ┃         ┗━━ 📄 ...
 ┗━━  📂 precision
    ┗━━ 📂 YOUR_MODEL_NAME

Step 4. Calculate Metrics for P, R and F1.

We cache the evaluation results for all questions in the cache directory. Here, we read results from the cache to calculate the final metrics.

For example, to calculate correctness.py:

python final_score/correctness.py --cache_dir cache --save_path final_results

The script will automatically calculate Recall and Precision, and then compute the F1 Score or Average Score.

Alternatively, you can calculate each metric individually. For example, to calculate Recall:

python final_score/recall.py --cache_dir cache/recall --save_path final_results

Then you can see a directory structured as follows:

📂final_results/
├─ 📂recall/
│  ├─ 📄recall_results.json
│  └─ 📄recall_errors.json
├─ 📂precision/
│  ├─ 📄precision_results.json
│  └─ 📄precision_errors.json
└─ 📂quality/
   └─ 📄quality_results.json

The P, R, F1 scores are stored in "quality_results.json".

Step 5. Calculate Accuracies for the answers.

Evaluate the direct answer:

python scripts/accuracy.py \
  --json_path "../inference/final_output/Qwen3-VL-30B-Thinking/Qwen3-VL-30B-Thinking_direct.json" \
  --excel_path "../inference/dataset/M3CoTBench.xlsx" \
  --output_path "Qwen3-VL-30B-Thinking_direct.json" \
  --model "gpt-4o" \
  --api_key  "sk-your-api-key-here" \
  --base_url "sk-your-api-url-here"

Evaluate the cot answer:

python scripts/accuracy.py \
  --json_path "../inference/final_output/Qwen3-VL-30B-Thinking/Qwen3-VL-30B-Thinking_cot.json" \
  --excel_path "../inference/dataset/M3CoTBench.xlsx" \
  --output_path "Qwen3-VL-30B-Thinking_cot.json" \
  --model "gpt-4o" \
  --api_key  "sk-your-api-key-here"\
  --base_url "sk-your-api-url-here"

Then the impact score can be calculated.

Step 6. Calculate the Efficiency Metrics for the Output Steps.

The durations for direct and CoT inferences are in the summary output file (e.g. Claude-Sonnet-4.5_summary.json).

{
    "cot": {
        "total_item_count": 1079,
        "successful_item_count": 1079,
        "failed_item_count": 0,
        "total_successful_time_s": 13598.991699999999,
        "total_wasted_time_s": 75.924144786
    },
    "direct": {
        "total_item_count": 1079,
        "successful_item_count": 1079,
        "failed_item_count": 0,
        "total_successful_time_s": 5051.896599999994,
        "total_wasted_time_s": 63.773742768000005
    }
}

Then the Latency score can be obtained by: total_successful_time_s_cot/total_successful_time_s_direct= 13598.9917/5051.8966=2.69.

The total matched steps can be seen in the output file for recall scores in "recall_results.json: in "final_results/recall/".

"Claude-Sonnet-4.5": {
        "overall_metrics": {
            "average_recall": 0.5971,
            "total_matched_steps": 2100
        },

...

Then the Efficiency Score can be obtained by: total_matched_steps/total_successful_time_s_cot=2100/13598.9917=0.1544

Step 7. Calculate the Consistency Score for the Output Steps.

We use GPT-4o and Gemini-2.5-Pro for evaluation and average the results.

python scripts/batch_processor.py \
  --result_dir "../inference/final_output/" \
  --output_dir "./final_consistency_results_gpt4o" \
  --question_file "../inference/dataset/M3CoTBench.xlsx" \
  --api-base-url "sk-your-api-url-here" \
  --api_key "sk-your-api-key-here" \
  --model_name gpt-4o

python scripts/lcs_analyzer.py
--type_file "../inference/dataset/type.xlsx" \
--csv_dir "./final_consistency_results_gpt4o" \
--output_dir "./consistency_score_gpt4o"

python scripts/batch_processor.py \
  --result_dir "../inference/final_output/" \
  --output_dir "./final_consistency_results" \
  --question_file "../inference/dataset/M3CoTBench.xlsx" \
  --api-base-url "sk-your-api-url-here" \
  --api_key "sk-your-api-key-here" \
  --model_name gemini-2.5-pro
python scripts/lcs_analyzer.py
--type_file "../inference/dataset/type.xlsx" \
--csv_dir "./final_consistency_results_gemini" \
--output_dir "./consistency_score_gemini"

Then the consistency score will be in "./consistency_score_gpt4o/lcs_results/lcs_analyzer_zh_summary.csv" and "./consistency_score_gemini/lcs_results/lcs_analyzer_zh_summary.csv".The column named "overall_average_similarity" contains the consistency score. You only need to average the two results.

📊Experiments

Performance score of different methods.

Metrics: ↑ Higher is Better, ↓ Lower is Better. Bold: Best result.

#	Model	Category	Correctness (↑)			Impact (↑)			Efficiency		Consistency C_path (↑)
#	Model	Category	F1	P	R	Acc_dir	Acc_step	I	E (↑)	L (↓)	Consistency C_path (↑)
1	LLava-CoT	Open-source	49.80	54.08	46.15	40.08	36.75	-3.33	0.06	1.56	77.02
2	InternVL3.5-8B	Open-source	56.48	60.61	52.88	56.81	53.61	-3.20	0.10	18.27	71.65
3	InternVL3.5-30B	Open-source	59.42	62.15	56.92	63.81	57.60	-6.21	0.03	16.68	76.30
4	Qwen3-VL-Instruct-8B	Open-source	55.17	52.74	57.84	51.30	46.62	-4.68	0.04	93.94	82.65
5	Qwen3-VL-Instruct-30B	Open-source	59.15	56.13	62.51	54.63	51.39	-3.24	0.03	35.63	83.01
6	Qwen3-VL-Thinking-8B	Open-source	59.87	59.84	59.91	48.33	52.83	+4.50	0.02	2.79	76.91
7	Qwen3-VL-Thinking-30B	Open-source	62.15	63.34	61.01	51.90	55.47	+3.57	0.02	1.15	76.02
8	GPT-4.1	Closed-source	60.76	58.32	63.42	56.77	57.97	+1.22	0.17	5.08	81.31
9	GPT-5	Closed-source	55.13	64.15	48.34	58.76	58.29	-0.47	0.06	1.10	65.39
10	Gemini 2.5 Pro	Closed-source	66.07	62.48	70.10	60.24	60.06	-0.18	0.10	1.52	82.00
11	Claude-Sonnet-4.5	Closed-source	56.50	53.62	59.71	51.25	51.07	-0.18	0.15	2.69	85.22
12	LLaVA-Med (7B)	Medical	30.51	36.33	26.30	29.38	29.29	-0.09	0.35	3.22	72.68
13	HuatuoGPT-Vision (7B)	Medical	49.45	51.17	47.85	41.89	34.94	-6.95	0.21	5.92	73.19
14	HealthGPT (3.8B)	Medical	32.56	47.27	24.83	44.11	41.98	-2.13	0.06	15.36	67.72
15	Lingshu-7B	Medical	57.57	63.96	52.34	50.00	42.08	-7.92	0.30	8.37	74.83
16	Lingshu-32B	Medical	59.16	65.68	53.82	51.77	44.95	-6.82	0.21	10.87	71.47
17	MedGemma-4B	Medical	48.13	50.29	46.14	43.33	41.29	-2.04	0.05	20.61	74.03
18	MedGemma-27B	Medical	50.98	48.33	53.81	46.06	45.88	-0.18	0.03	23.71	82.55

🙏 Acknowledgments

We would like to acknowledge that some parts of the code were inspired by and referenced from MME-CoT.

✒️Citation

If you find M3CoTBench useful for your research, please consider giving a star⭐ and citation📝 :)

@misc{jiang2026m3cotbenchbenchmarkchainofthoughtmllms,
      title={M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding}, 
      author={Juntao Jiang and Jiangning Zhang and Yali Bi and Jinsheng Bai and Weixuan Liu and Weiwei Jin and Zhucun Xue and Yong Liu and Xiaobin Hu and Shuicheng Yan},
      year={2026},
      eprint={2601.08758},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2601.08758}, 
}

✉️Contact

[email protected]

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
assets		assets
environment		environment
evaluation		evaluation
inference		inference
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😊Continuous Updates

✨ Highlight!!!

📬Summary of Contents

🔬 Data Pipeline

🌻 Benchmark Overview

🔨Installation

1. Install requirements

2. Downloads the M3CoTBench Database

💪Usage

3. Inference

Specialized Medical Models

General Framework

4. Evaluation

📊Experiments

Performance score of different methods.

🙏 Acknowledgments

✒️Citation

✉️Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

😊Continuous Updates

✨ Highlight!!!

📬Summary of Contents

🔬 Data Pipeline

🌻 Benchmark Overview

🔨Installation

1. Install requirements

2. Downloads the M3CoTBench Database

💪Usage

3. Inference

Specialized Medical Models

General Framework

4. Evaluation

📊Experiments

Performance score of different methods.

🙏 Acknowledgments

✒️Citation

✉️Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages