Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

This repository contains codebase and data annotations for our paper M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval.

What is M2-RAAP?

M2-RAAP is a multi-modal recipe for effective and efficient zero-shot video-text retrieval. Specifically, M2-RAAP 1) filters and refines video-text pairs to improve the data quality, 2) adopts key-frames as video inputs to reduce pre-training time, and 3) introduces temporal modeling and video feature enhancement to promote pre-training performance. Compared with the baselines, M2-RAAP employs only 10% of data volume (10M -> 1M) and consumes only 5% of pre-training time (1920h -> 92h), reaching a new SOTA on four English downstream zero-shot video-text retrieval datasets and two Chinese ones.

Codebase

We are preparing codebase and data annotations, which will be available soon.

Citation

If you find M2-RAAP useful, please consider citing the following paper:

@misc{dong2024m2raap,
    title={M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval},
    author={Xingning Dong and Zipeng Feng and Chunluan Zhou and Xuzheng Yu and Ming Yang and Qingpei Guo},
    url={https://arxiv.org/abs/2401.17797},
    year={2024},
    eprint={2401.17797},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}