This repo contains the code for the paper: "It's Just Another Day: Unique Video Captioning by Discriminative Prompting", ACCV 2024 oral presentation.
This repo is forked from LaViLa. Please follow their installation instructions.
We recommend downloading all the models/checkpoints/pre-extracted features to a directory called storage directory for simplicity.
First, you'll need to download videos from Ego4D. Once this is done, to preprocess them (i.e. resize to a sensible resolution and chop into 300s chunks for quick reading), use the following command and set the options at the top of the script.
python preprocess_ego4d.pyThis can be run on lots of CPU nodes in parallel with the following slurm script. Remember to change the options at the top of the script.
sbatch lavila/slurm/bc4/submit_convert_ego4d.shFor all subsequent scripts, you'll need to set the --root option to the directory where the preprocessed videos (or features) are stored.
Download the LaViLa video/text embedding network and VCLM checkpoints. The files you need are:
vclm_openai_timesformer_large_336px_gpt2_xl.pt_ego4d.jobid_246897.ep_0003.md5sum_443263.pth
clip_openai_timesformer_large.narrator_rephraser.ep_0003.md5sum_c89337.pth
The uniqueness is evaluated in the EgoVLP video/text embedding space. Download the EgoVLP checkpoint from here:
clip_openai_timesformer_large_336px_distilbert_base.baseline.ep_0003.pth
CDPNet predicts the similarity between video clips and prompted captions. Download the CDPNet checkpoints from here:
arch10_CSPairClsP_p10_e30_model.pkl
The clips used for the benchmark are contained in jsons. The first has the Ego4D video ID, timestamp, and original narration. The second groups them into sets. We use the first 300 sets for our benchmark, but we provide all 880 just in case they are useful. Because of the way LaViLa loads files, you'll need these three files in the same directory. Note that no training happens - train.pkl and val.pkl are just duplicates of the validation data:
train_preds.pkl, val_preds.pkl, amin_clip_P1_OnlinePerVid_p10_e0.pkl
To run inference on a the egocentric benchmark for the T=+10 (i.e. the original 5s clip and advancing two) LaViLa + CDPNet, run the following command. Remember to use the full paths for the saved models.
python main_run_unique_captioner.py --cap_resume storage/vclm_openai_timesformer_large_336px_gpt2_xl.pt_ego4d.jobid_246897.ep_0003.md5sum_443263.pth --pp_resume storage/arch10_CSPairClsP_p10_e30_model.pkl --emb_resume storage/clip_openai_timesformer_large.narrator_rephraser.ep_0003.md5sum_c89337.pth --eval_emb_resume storage/clip_openai_timesformer_large_336px_distilbert_base.baseline.ep_0003.pth --max_offset 10 --output_dir results --wbr cdp_lavila-vclm_egovlp-eval_10sTo run the LaViLa baseline for T=+10s, run the following. Notice how we specify --fixed_clip_len 15 as this includes the original 5s clip and the 10s advancement.
python main_run_unique_captioner.py --cap_resume storage/vclm_openai_timesformer_large_336px_gpt2_xl.pt_ego4d.jobid_246897.ep_0003.md5sum_443263.pth --pp_resume storage/arch10_CSPairClsP_p10_e30_model.pkl --emb_resume storage/clip_openai_timesformer_large.narrator_rephraser.ep_0003.md5sum_c89337.pth --eval_emb_resume storage/clip_openai_timesformer_large_336px_distilbert_base.baseline.ep_0003.pth --comb_maxp 1 --no_cap_default lav --fixed_clip_len 15 --output_dir results --wbr base_lavila-vclm_egovlp-eval_10sThis runs everything online with no pre-extracted features.
If you want to train CDPNet instead of using our pre-trained one, you'll first need the embeddings of prompted captions from the training set with the LaViLa XL VCLM and Large embedding space as above. This CDPNet will be specific to these two LaViLa models and the set of prompts, so probably won't be that useful to you. However, if you do want them, you can either extract and embed all these yourself, or contact me and I'll find a way to get them to you (they're quite large).
Next, run the following command:
python main_train_prompt_predictor.py --output-str arch10This reads in prompted captions generated by the VCLM, and trains CDPNet to predict the cosine similarity between video clips and prompted captions.
We provide pre-extracted features for CDP on the timeloop movie benchmark here:
And the pre-extracted features for the baseline here:
tlm_10movies_baseline_feats.pth
And a pre-trained CDPNet model here:
tlm-arch10_CSPairClsP_p10_e30_model.pkl
To run the VideoLlama +CDP at T=+4s, run the following command:
python main_run_unique_captioner.py --dataset tlm --d_model 768 --root storage/grouped_timeloop_feature_seqlen2_10movies.pth --temporal_offset 2 --pp_resume storage/tlm-arch10_CSPairClsP_p10_e30_model.pkl --pp_threshold -2.0 --max_offset 4 --n_sets 10 --no_cap_default best_margin --enforce_amax 0 --priority none --wbr tlm-cdp-4For the equivalent non-CDP baseline, run the following:
python main_run_unique_captioner.py --dataset tlm --d_model 768 --root storage/grouped_timeloop_feature_seqlen2_numseq6_gap0_start0.pth --temporal_offset 2 --pp_resume tlm-arch10_CSPairClsP_p10_e30_model.pkl --pp_threshold 2.0 --max_offset 4 --n_sets 12 --no_cap_default lav --comb_maxp 1 --enforce_amax 0 --priority none --wbr tlm-base-4If you want to extract your own features, then you can download the frames here:
Our repetition timestamps are here: timeloop_timestamps.json
If you find this work interesting or useful, please cite:
@inproceedings{perrett2024unique,
title={It's Just Another Day: Unique Video Captioning by Discriminative Prompting},
author={Perrett, Toby and Han, Tengda and Damen, Dima and Zisserman, Andrew},
booktitle={ACCV},
year={2024}
}Our codebase is forked from LaViLa and the benchmark uses Ego4D footage and narrations, so please cite them as well:
@inproceedings{zhao2023learning,
title={Learning Video Representations from Large Language Models},
author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
booktitle={CVPR},
year={2023}
}@inproceedings{grauman2022ego4d,
title={Ego4d: Around the world in 3,000 hours of egocentric video},
author={Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and others},
booktitle={CVPR},
year={2022}
}