Junyu Xie1, Tengda Han1, Max Bain1, Arsha Nagrani1, Eshika Khandelwal2 3, Gül Varol1 3, Weidi Xie1 4, Andrew Zisserman1
1 Visual Geometry Group, Department of Engineering Science, University of Oxford
2 CVIT, IIIT Hyderabad
3 LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
4 CMIC, Shanghai Jiao Tong University
In this work, we evaluate our model on common AD benchmarks including CMD-AD, MAD-Eval, and TV-AD.
- CMD-AD can be downloaded here.
- MAD-Eval can be downloaded here.
- TV-AD can be downloaded following instructions here.
- All annotations can be found in
resources/annotations/.
- The AD predictions (by Qwen2-VL+LLaMA3 or GPT-4o+GPT-4o) can be downloaded here.
We propose a new evaluation metric, named "action score", that focuses on whether a specific ground truth (GT) action is captured within the prediction.
The detailed evaluation code can be found in action_score/.
-
Basic Dependencies:
python>=3.8,pytorch=2.1.2,transformers=4.46.0,Pillow,pandas,decord,opencv -
For inference based on open-sourced models, set up path for cache (for Qwen2-VL, LLaMA3, etc.) by modifying
os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/"instage1/main_qwen2vl.pyandstage2/main_llama3.py -
For inference based on proprietary GPT-4o models, set up path for API keys by modifying
os.environ["OPENAI_API_KEY"] = <open-api-key>instage1/main_gpt4o.pyandstage2/main_gpt4o.py
To structure the context frames according to shots, as well as recognise characters in each shot, please refer to guideline in preprocess/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4.csv)
To predict the film grammar including shot scales and thread structures, please follow the steps detailed in film_grammar/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv)
python stage1/main_qwen2vl.py \ # or stage1/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \ # e.g., "cmdad"
--anno_path={anno_path} # e.g., "resources/annotations/cmdad_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv" \
--charbank_path={charbank_path} # e.g., "resources/charbanks/cmdad_charbank.json" \
--video_dir={video_dir} \
--save_dir="{save_dir} \
--font_path="resources/fonts/times.ttf" \
--shot_label
--dataset: choices are cmdad, madeval, and tvad.
--anno_path: path to AD annotations (with character recognition results and film grammar predictions), available in resources/annotations.
--charbank_path: path to external character banks, available in resources/charbanks/.
--video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only).
--save_dir: directory to save output csv.
--font_path: path to font file for shot labels (default is Times New Roman)
--shot_label: add shot number label at the top-left of each frame
python stage2/main_llama3.py \ # or stage2/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \ # e.g., "cmdad"
--mode={mode} \ # e.g., "single"
--pred_path={pred_path} \
--save_dir={save_dir}
--dataset: choices are cmdad, madeval, and tvad.
--mode: single for single AD output; assistant for five candidate AD outputs
--pred_path: path to the stage1 saved csv file.
--save_dir: directory to save output csv.
If you find this repository helpful, please consider citing our work! 😊
@InProceedings{xie2025shotbyshot,
title = {Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
author = {Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
booktitle = {ICCV},
year = {2025}
}
Qwen2-VL: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
GPT-4o: https://openai.com/api/