Skip to content

Latest commit

 

History

History
77 lines (61 loc) · 2.57 KB

File metadata and controls

77 lines (61 loc) · 2.57 KB

Video Transformer Network

Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann [Paper]


Installation

pip install timm
pip install transformers[torch]

Getting started

To use VTN models please refer to the configs under configs/Kinetics, or see the MODEL_ZOO.md for pre-trained models*.

To train ViT-B-VTN on your dataset (see paper for details):

python tools/run_net.py \
  --cfg configs/Kinetics/VIT_B_VTN.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset

To test the trained ViT-B-VTN on Kinetics-400 dataset:

python tools/run_net.py \
  --cfg configs/Kinetics/VIT_B_VTN.yaml \
  DATA.PATH_TO_DATA_DIR path_to_kinetics_dataset \
  TRAIN.ENABLE False \
  TEST.CHECKPOINT_FILE_PATH path_to_model \
  TEST.CHECKPOINT_TYPE pytorch

* VTN models in MODEL_ZOO.md produce slightly different results than those reported in the paper due to differences between the PySlowFast code base and the original code used to train the models (mainly around data and video loading).

Code and Hyperparameters

VTN main architecture

VTN Longformer Model

Hyperparameters

Citing VTN

If you find VTN useful for your research, please consider citing the paper using the following BibTeX entry.

@article{neimark2021video,
  title={Video Transformer Network},
  author={Neimark, Daniel and Bar, Omri and Zohar, Maya and Asselmann, Dotan},
  journal={arXiv preprint arXiv:2102.00719},
  year={2021}
}

Additional Qualitative Results

Label: Tai chi. Prediction: Tai chi.

Label: Chopping wood. Prediction: Chopping wood.

Label: Archery. Prediction: Archery.

Label: Throwing discus. Prediction: Flying kite.

Label: Surfing water. Prediction: Parasailing.