Skip to content

MoramiSu/QFT-MICCAI2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QFT

The official implementation of Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training. We utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features.We leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modalityalignment.

image

Main Results

Results in Visual Recognition

image

image

Results in Report Generation

image

image

Poster

Poster.pdf

Implementation

Setting

  • Set the hyperparameter and path in ./constants.py

Run training process

  • Run ./models/QFT/QFT_training.py to train the model.
python QFT_training.py --gpus 1 --strategy ddp --precision 16 --img_encoder vit_base

Citation

@article{su2024design,
title={Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training},
author={Su, Tongkun and Li, Jun and Zhang, Xi and Jin, Haibo and Chen, Hao and Wang, Qiong and Lv, Faqin and Zhao, Baoliang and Hu, Yin},
journal={arXiv preprint arXiv:2404.00226},
year={2024}
}

Acknowledgement

This work was supported by the Guangzhou Science and Technology Program (No. 2023B01J0022), the Key Fundamental Research Program of Shenzhen (No. JCYJ20220818101408019), NSFC General Project (No. 62072452), and the Regional Joint Fund of Guangdong (No. 2021B1515130003, 2021B1515120011).

About

The official implementation of Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages