Skip to content

juntaoJianggavin/M3CoTBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

Juntao Jiang 1★ · Jiangning Zhang 1★ · Yali Bi 2 · Jinsheng Bai 1 · Weixuan Liu 3 · Weiwei Jin 4 · Zhucun Xue 1 · Yong Liu 1† · Xiaobin Hu 5 · Shuicheng Yan 5

1Zhejiang University     2University of Science and Technology of China     3East China Normal University
4Zhejiang Provincial People’s Hospital     5National University of Singapore

arXiv PDF webpage-Web

😊Continuous Updates

This repository is a comprehensive collection of resources for M3CoTBench, If you find any work missing or have any suggestions, feel free to pull requests or contact us. We will promptly add the missing papers to this repository.

✨ Highlight!!!

Compared with existing multimodal medical benchmarks, our proposed M3CoTBench offers the following key advantages:

  1. Diverse Medical VQA Dataset.
    We curate a 1,079-image medical visual question answering (VQA) dataset spanning 24 imaging modalities, stratified by difficulty and annotated with step-by-step reasoning aligned with real clinical diagnostic workflows.
  2. Multidimensional CoT-Centric Evaluation Metrics.
    We propose a comprehensive evaluation protocol that measures reasoning correctness, efficiency, impact, and consistency, enabling fine-grained and interpretable analysis of CoT behaviors across diverse MLLMs.
  3. Comprehensive Model Analysis and Case Studies.
    We benchmark both general-purpose and medical-domain MLLMs using quantitative metrics and in-depth qualitative case studies, revealing strengths and failure modes in clinical reasoning to guide future model design.

🤓 You can view the scores and comparisons of each method at M3CoTBench LeaderBoard.

📬Summary of Contents

🔬 Data Pipeline

Data acquisition and annotation pipeline of M3CoTBench. a) Carefully curated medical images from various public sources. b) Multi-type and multi-difficulty QA generation via LLMs and expert calibration. c) Structured annotation of key reasoning steps aligned with clinical diagnostic workflows.

🌻 Benchmark Overview

Overview of M3CoTBench. Top: The benchmark covers 24 imaging modalities/examination types, 4 question types, and 13 clinical reasoning tasks. Middle: CoT annotation examples and 4 evaluation dimensions. Bottom: The distribution of image-QA pairs across a) modalities, b) question types, and c) tasks.

🔨Installation

1. Install requirements

git clone https://github.com/juntaoJianggavin/M3CoTBench.git

2. Downloads the M3CoTBench Database

This section provides access to the M3CoTBench Database, which contains the complete .png image data of M3CoTBench and a .xlsx file (the file provides the question, answer and annotated CoT steps in the M3CoTBench Database).

🥰You can download M3CoTBench Database to your local path using the following command:

huggingface-cli download --repo-type dataset --resume-download APRIL-AIGC/M3CoTBench --local-dir $YOUR_LOCAL_PATH

Then put the M3CoTBench.xlsx and images/ into the M3CoTBench/inference/datasets/

💪Usage

3. Inference

If you want to run your own model, you can follow the procedure below, but you’re also free to run the experiments in whichever way you prefer. One approach uses a CoT prompt, and the other directly outputs the results, which can then go straight into the evaluation section.

Enter the directory:

cd M3CoTBench/inference/

Specialized Medical Models

Note: Model weights should be placed in M3CoTBench/inference/pretrain. Each medical model requires its own specific Conda environment.

For HealthGPT:

conda env create -f ../environment/healthgpt_environment.yaml
conda activate M3CoTBench_healthgpt
bash  models/medical_models/HealthGPT/llava/demo/run_batch_eval.sh

For HuatuoGPT-Vision:

conda env create -f ../environment/huatuo_environment.yaml
conda activate M3CoTBench_huatuo
# Direct Inference
python models/medical_models/HuatuoGPT-Vision/eval.py --run_direct
# CoT Inference
python models/medical_models/HuatuoGPT-Vision/eval.py --run_cot

For LLaVA-Med:

conda env create -f ../environment/llavamed_environment.yaml
conda activate M3CoTBench_llavamed
# Direct Inference
python models/medical_models/LLaVA-Med/llava/eval/model_vqa.py --mode direct
# CoT Inference
python models/medical_models/LLaVA-Med/llava/eval/model_vqa.py --mode cot

Note: Lingshu and MedGemma are integrated into the General Framework below.

General Framework

Environment: M3CoTBench

conda env create -f ../environment/environment.yaml
conda activate M3CoTBench_env

(1) API Inference

# Start "GPT-5" on port xxxxx with 4 internal processes
bash scripts/run_api_model.sh "GPT-5" xxxxx 4

# Start "Claude-Sonnet-4.5" on port xxxxx (default 4 processes)
bash scripts/run_api_model.sh "Claude-Sonnet-4.5" xxxxx

(2) Local Inference

bash scripts/run_local_gpu_model.sh LLaVA-CoT 1,2,3,4,5,6 all xxxxx

To rerun failed inference data and update results:

cd M3CoTBench/inference/
# 1. Rerun failed files and merge into the original JSON
python reprocess_failed.py \
    --input-file final_output/Lingshu-32B/Lingshu-32B_direct.json \
    --model "Lingshu-32B" \
    --data-path "dataset/M3CoTBench.xlsx" \
    --image-dir "dataset/images" \
    --update-in-place

# 2. Recalculate timing summary
python recalculate_summary.py \
    --results-file final_output/Lingshu-32B/Lingshu-32B_direct.json \
    --summary-file final_output/Lingshu-32B/Lingshu-32B_summary.json

4. Evaluation

Step 1: Merge Chain-of-Thought Fields.

Merge the CoT steps of the correct answers and convert the format to XLSX.

cd M3CoTBench/evaluation/
python combine_fields.py

Step 2: Reformat results.

Format Inference Results Batch format the inference JSON files into the evaluation output format (XLSX). This file will contain both the CoT of the correct answer and the predicted answer from the inference.

python tools/update_lmmseval_json.py

Step 3. Run Evaluation Scripts.

You can run metrics individually. For example, to evaluate recall:

bash scripts/recall.sh
bash scripts/precision.sh

Note: Simply update the data path for YOUR_MODEL_NAME inside recall.sh (or other script files).

After the GPT evaluation, you should see a cache/ directory structured as follows:

📂 cache
 ┣━━ 📂 recall
 ┃    ┗━━ 📂 YOUR_MODEL_NAME
 ┃         ┣━━ 📄 1.json
 ┃         ┣━━ 📄 2.json
 ┃         ┗━━ 📄 ...
 ┗━━  📂 precision
    ┗━━ 📂 YOUR_MODEL_NAME

Step 4. Calculate Metrics for P, R and F1.

We cache the evaluation results for all questions in the cache directory. Here, we read results from the cache to calculate the final metrics.

For example, to calculate correctness.py:

python final_score/correctness.py --cache_dir cache --save_path final_results

The script will automatically calculate Recall and Precision, and then compute the F1 Score or Average Score.

Alternatively, you can calculate each metric individually. For example, to calculate Recall:

python final_score/recall.py --cache_dir cache/recall --save_path final_results

Then you can see a directory structured as follows:

📂final_results/
├─ 📂recall/
│  ├─ 📄recall_results.json
│  └─ 📄recall_errors.json
├─ 📂precision/
│  ├─ 📄precision_results.json
│  └─ 📄precision_errors.json
└─ 📂quality/
   └─ 📄quality_results.json

The P, R, F1 scores are stored in "quality_results.json".

Step 5. Calculate Accuracies for the answers.

Evaluate the direct answer:

python scripts/accuracy.py \
  --json_path "../inference/final_output/Qwen3-VL-30B-Thinking/Qwen3-VL-30B-Thinking_direct.json" \
  --excel_path "../inference/dataset/M3CoTBench.xlsx" \
  --output_path "Qwen3-VL-30B-Thinking_direct.json" \
  --model "gpt-4o" \
  --api_key  "sk-your-api-key-here" \
  --base_url "sk-your-api-url-here"

Evaluate the cot answer:

python scripts/accuracy.py \
  --json_path "../inference/final_output/Qwen3-VL-30B-Thinking/Qwen3-VL-30B-Thinking_cot.json" \
  --excel_path "../inference/dataset/M3CoTBench.xlsx" \
  --output_path "Qwen3-VL-30B-Thinking_cot.json" \
  --model "gpt-4o" \
  --api_key  "sk-your-api-key-here"\
  --base_url "sk-your-api-url-here"

Then the impact score can be calculated.

Step 6. Calculate the Efficiency Metrics for the Output Steps.

The durations for direct and CoT inferences are in the summary output file (e.g. Claude-Sonnet-4.5_summary.json).

{
    "cot": {
        "total_item_count": 1079,
        "successful_item_count": 1079,
        "failed_item_count": 0,
        "total_successful_time_s": 13598.991699999999,
        "total_wasted_time_s": 75.924144786
    },
    "direct": {
        "total_item_count": 1079,
        "successful_item_count": 1079,
        "failed_item_count": 0,
        "total_successful_time_s": 5051.896599999994,
        "total_wasted_time_s": 63.773742768000005
    }
}

Then the Latency score can be obtained by: total_successful_time_s_cot/total_successful_time_s_direct= 13598.9917/5051.8966=2.69.

The total matched steps can be seen in the output file for recall scores in "recall_results.json: in "final_results/recall/".

"Claude-Sonnet-4.5": {
        "overall_metrics": {
            "average_recall": 0.5971,
            "total_matched_steps": 2100
        },

...

Then the Efficiency Score can be obtained by: total_matched_steps/total_successful_time_s_cot=2100/13598.9917=0.1544

Step 7. Calculate the Consistency Score for the Output Steps.

We use GPT-4o and Gemini-2.5-Pro for evaluation and average the results.

python scripts/batch_processor.py \
  --result_dir "../inference/final_output/" \
  --output_dir "./final_consistency_results_gpt4o" \
  --question_file "../inference/dataset/M3CoTBench.xlsx" \
  --api-base-url "sk-your-api-url-here" \
  --api_key "sk-your-api-key-here" \
  --model_name gpt-4o

python scripts/lcs_analyzer.py
--type_file "../inference/dataset/type.xlsx" \
--csv_dir "./final_consistency_results_gpt4o" \
--output_dir "./consistency_score_gpt4o" 
python scripts/batch_processor.py \
  --result_dir "../inference/final_output/" \
  --output_dir "./final_consistency_results" \
  --question_file "../inference/dataset/M3CoTBench.xlsx" \
  --api-base-url "sk-your-api-url-here" \
  --api_key "sk-your-api-key-here" \
  --model_name gemini-2.5-pro
python scripts/lcs_analyzer.py
--type_file "../inference/dataset/type.xlsx" \
--csv_dir "./final_consistency_results_gemini" \
--output_dir "./consistency_score_gemini" 

Then the consistency score will be in "./consistency_score_gpt4o/lcs_results/lcs_analyzer_zh_summary.csv" and "./consistency_score_gemini/lcs_results/lcs_analyzer_zh_summary.csv".The column named "overall_average_similarity" contains the consistency score. You only need to average the two results.

📊Experiments

Performance score of different methods.

Metrics: ↑ Higher is Better, ↓ Lower is Better. Bold: Best result.

# Model Category Correctness (↑) Impact (↑) Efficiency Consistency
Cpath (↑)
F1 P R Accdir Accstep I E (↑) L (↓)
1 LLava-CoT Open-source 49.8054.0846.15 40.0836.75-3.33 0.061.5677.02
2 InternVL3.5-8B Open-source 56.4860.6152.88 56.8153.61-3.20 0.1018.2771.65
3 InternVL3.5-30B Open-source 59.4262.1556.92 63.8157.60-6.21 0.0316.6876.30
4 Qwen3-VL-Instruct-8B Open-source 55.1752.7457.84 51.3046.62-4.68 0.0493.9482.65
5 Qwen3-VL-Instruct-30B Open-source 59.1556.1362.51 54.6351.39-3.24 0.0335.6383.01
6 Qwen3-VL-Thinking-8B Open-source 59.8759.8459.91 48.3352.83+4.50 0.022.7976.91
7 Qwen3-VL-Thinking-30B Open-source 62.1563.3461.01 51.9055.47+3.57 0.021.1576.02
8 GPT-4.1 Closed-source 60.7658.3263.42 56.7757.97+1.22 0.175.0881.31
9 GPT-5 Closed-source 55.1364.1548.34 58.7658.29-0.47 0.061.1065.39
10 Gemini 2.5 Pro Closed-source 66.0762.4870.10 60.2460.06-0.18 0.101.5282.00
11 Claude-Sonnet-4.5 Closed-source 56.5053.6259.71 51.2551.07-0.18 0.152.6985.22
12 LLaVA-Med (7B) Medical 30.5136.3326.30 29.3829.29-0.09 0.353.2272.68
13 HuatuoGPT-Vision (7B) Medical 49.4551.1747.85 41.8934.94-6.95 0.215.9273.19
14 HealthGPT (3.8B) Medical 32.5647.2724.83 44.1141.98-2.13 0.0615.3667.72
15 Lingshu-7B Medical 57.5763.9652.34 50.0042.08-7.92 0.308.3774.83
16 Lingshu-32B Medical 59.1665.6853.82 51.7744.95-6.82 0.2110.8771.47
17 MedGemma-4B Medical 48.1350.2946.14 43.3341.29-2.04 0.0520.6174.03
18 MedGemma-27B Medical 50.9848.3353.81 46.0645.88-0.18 0.0323.7182.55

🙏 Acknowledgments

We would like to acknowledge that some parts of the code were inspired by and referenced from MME-CoT.

✒️Citation

If you find M3CoTBench useful for your research, please consider giving a star⭐ and citation📝 :)

@misc{jiang2026m3cotbenchbenchmarkchainofthoughtmllms,
      title={M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding}, 
      author={Juntao Jiang and Jiangning Zhang and Yali Bi and Jinsheng Bai and Weixuan Liu and Weiwei Jin and Zhucun Xue and Yong Liu and Xiaobin Hu and Shuicheng Yan},
      year={2026},
      eprint={2601.08758},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2601.08758}, 
}

✉️Contact

About

Official implementation of the paper "M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages