The Pytorch implementation for "Video-Text Pre-training with Learned Regions" (arxiv)
We are still cleaning up the code further and preparing for pre-training weights.
Overall, this code is built on PyTorch with DistributedDataParallel (DDP).
- Create conda env and install required packages via
sh setup_myEnv.sh - Create some important folders
mkdir data(you can symlink huge datasets to this folder)mkdir meta_data(put meta data of each dataset here)mkdir results
- Download Pre-training data
PS: Not all videos are avaible so that you need to modify the metadata depend on your case. We also provide our metadata in here.
- Run
sh pre-training.sh(Commands with different settings are listed in this script.)
- Download data (see https://github.com/m-bain/frozen-in-time#-finetuning-benchmarks-msr-vtt)
- Run
sh fine-tune.sh.
This code is based off Frozen in Time
@article{yan2021video,
title={Video-Text Pre-training with Learned Regions},
author={Yan, Rui and Shou, Mike Zheng and Ge, Yixiao and Wang, Alex Jinpeng and Lin, Xudong and Cai, Guanyu and Tang, Jinhui},
journal={arXiv preprint arXiv:2112.01194},
year={2021}
}