MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
Suhao Yu*, Haojin Wang*, Juncheng Wu*, Cihang Xie, Yuyin Zhou

📢 Breaking News

[📄💥 May 22, 2025] Our arXiv paper is released.
[💾 May 22, 2025] Full dataset released.

Star 🌟 us if you think it is helpful!!

⚡Introduction

MedFrameQA introduces multi-image, clinically grounded questions that require comprehensive reasoning across all images. Unlike prior benchmarks such as SLAKE and MedXpertQA, it emphasizes diagnostic complexity, expert-level knowledge, and explicit reasoning chains.

We develop a scalable pipeline that automatically constructs multi-image, clinically grounded VQA questions from medical education videos.
We benchmark ten state-of-the-art MLLMs on MEDFRAMEQA and find that their accuracies mostly fall below 50% with substantial performance across different body systems, organs, and modalities.

We open-sourced our data and code here.

🚀 Dataset construction pipeline

MedFrameQA generation pipeline contains four stages:

Medical Video Collection: Collecting 3,420 medical videos via clinical search queries;
Frame-Caption Pairing: Extracting keyframes and aligning with transcribed captions;
Multi-Frame Merging: Merging clinically related frame-caption pairs into multi-frame clips;
Question-Answer Generation: Generating multi-image VQA from the multi-frame clips.

📚 Statistical overview of MedFrameQA

In figure (a), we show the distribution across body systems; (b) presents the distribution across organs; (c) shows the distribution across imaging modalities; (d) provides a word cloud of keywords in MedFrameQA; and (e) reports the distribution of frame counts per question.

🤗 Dataset Download

Dataset	🤗 Huggingface Hub
MedFrameQA	SuhaoYu1020/MedFrameQA

🏆 Results

Accuracy by Human Body System on MedFrameQA

Accuracy by Modality and Frame Count on MedFrameQA

💬 Quick Start

⏬ Install

Using Linux system,

Clone this repository and navigate to the folder

git clone https://github.com/haojinw0027/MedFrameQA.git
cd MedFrameQA

Install Package

conda create -n medframeqa python=3.10 -y
conda activate medframeqa
pip install -r requirements.txt
cd src

🎬 Generate VQA pairs from Video

Download video and audio

python process.py --process_stage download_process --csv_file ../data/30_disease_video_id.csv 

# Specify the number of videos to be downloaded
python process.py --process_stage download_process --csv_file ../data/30_disease_video_id.csv --num_ids number(-1 for all)

Extract frame from video and generate transcripts from audio

python process.py --process_stage video_process --csv_file ../data/30_disease_video_id.csv

Frame-caption pairing

python process.py --process_stage pair_process --csv_file ../data/30_disease_video_id.csv 

# Specify the time intervals for the selection of video frames
python process.py --process_stage pair_process --csv_file ../data/30_disease_video_id.csv --bias_time 20

Multi-frame merging and question-answer generation

python process.py --process_stage vqa_process --csv_file ../data/30_disease_video_id.csv 

# Specify the max frame num of one question
python process.py --process_stage vqa_process --csv_file ../data/30_disease_video_id.csv --max_frame_num 5

🧐 Evaluate on MLLMs

python eval_process.py --input_file "your vqa pairs file path" --output_dir ../eval --model_name "your model"

# Specify the number of questions you want to evaluate
python eval_process.py --input_file "your vqa pairs file path" --output_dir ../eval --model_name "your model" --num_q number(-1 for all)

You can download our datasets to evaluate at SuhaoYu1020/MedFrameQA

📜 Citation

If you find MedFrameQA useful for your research and applications, please cite using this BibTeX:

@misc{yu2025medframeqamultiimagemedicalvqa,
      title={MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning}, 
      author={Suhao Yu and Haojin Wang and Juncheng Wu and Cihang Xie and Yuyin Zhou},
      year={2025},
      eprint={2505.16964},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.16964}, 
}

🙏 Acknowledgement

We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data		data
images		images
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

📢 Breaking News

⚡Introduction

🚀 Dataset construction pipeline

📚 Statistical overview of MedFrameQA

🤗 Dataset Download

🏆 Results

Accuracy by Human Body System on MedFrameQA

Accuracy by Modality and Frame Count on MedFrameQA

💬 Quick Start

⏬ Install

🎬 Generate VQA pairs from Video

Download video and audio

Extract frame from video and generate transcripts from audio

Frame-caption pairing

Multi-frame merging and question-answer generation

🧐 Evaluate on MLLMs

📜 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

haojinw0027/MedFrameQA

Folders and files

Latest commit

History

Repository files navigation

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

📢 Breaking News

⚡Introduction

🚀 Dataset construction pipeline

📚 Statistical overview of MedFrameQA

🤗 Dataset Download

🏆 Results

Accuracy by Human Body System on MedFrameQA

Accuracy by Modality and Frame Count on MedFrameQA

💬 Quick Start

⏬ Install

🎬 Generate VQA pairs from Video

Download video and audio

Extract frame from video and generate transcripts from audio

Frame-caption pairing

Multi-frame merging and question-answer generation

🧐 Evaluate on MLLMs

📜 Citation

🙏 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages