Official Pytorch implementation of Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding (AAAI 2022).
Paper is at https://arxiv.org/pdf/2109.04872v2.pdf.
Paper explanation in Zhihu (in Chinese) is at https://zhuanlan.zhihu.com/p/446203594.
Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space.
May, 2022 - We released the code for spatio-temporal video grounding (HC-STVG dataset) here.
Dec, 2021 - We uploaded the code and trained weights for Charades-STA, ActivityNet-Captions and TACoS datasets.
- Download the video feature provided by 2D-TAN. The groundtruth file is already uploaded in the
datasetfolder, where I directly use the groundtruth file of 2D-TAN for ActivityNet and TACoS dataset, and I change the original form of Charades dataset in 2D-TAN (as .txt file) to be the same form with other two datasets (as .json file) for more simplicity of my code for loading datasets. - Extract and put the feature in the corresponding dataset in the
datasetfolder. For configurations of feature/groundtruth's paths, please refer to./mmn/config/paths_catalog.py. (ann_file is the annotation, feat_file is the video feature)
Our code is developed on the third-party implementation of 2D-TAN, so we have similar dependencies with it, such as:
yacs h5py terminaltables tqdm pytorch transformers
We provide scripts for simplifying training and inference. For training our model, we provide a script for each dataset (e.g., ./scripts/tacos_train.sh). For evaluating the performance, we provide ./scripts/eval.sh.
For example, for training model in TACoS dataset in tacos_train.sh, we need to select the right config in config and decide the GPU by yourself in gpus (gpu id in your server) and gpun (total number of gpus).
# find all configs in configs/
config=pool_tacos_128x128_k5l8
# set your gpu id
gpus=0,1
# number of gpus
gpun=2
# please modify it with different value (e.g., 127.0.0.2, 29502) when you run multi mmn task on the same machine
master_addr=127.0.0.3
master_port=29511
Similarly, to evaluate the model, just change the information in eval.sh. Our trained weights for three datasets are in the Google Drive.
If you find our code useful, please generously cite our paper.
@inproceedings{DBLP:conf/aaai/00010WLW22,
author = {Zhenzhi Wang and
Limin Wang and
Tao Wu and
Tianhao Li and
Gangshan Wu},
title = {Negative Sample Matters: {A} Renaissance of Metric Learning for Temporal
Grounding},
booktitle = {{AAAI}},
pages = {2613--2623},
publisher = {{AAAI} Press},
year = {2022}
}
For any question, please raise an issue (preferred) or contact
Zhenzhi Wang: [email protected]
Tao Wu: [email protected] (for HC-STVG only)
We appreciate 2D-TAN for video feature and configurations, and the third-party implementation of 2D-TAN for its implementation with DistributedDataParallel. Disclaimer: the performance gain of this third-party implementation is due to a tiny mistake of adding val set into training, yet our reproduced result is similar to the reported result in 2D-TAN paper.