Shot🎞️-by-Shot🎞️: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie¹, Tengda Han¹, Max Bain¹, Arsha Nagrani¹, Eshika Khandelwal² ³, Gül Varol¹ ³, Weidi Xie¹ ⁴, Andrew Zisserman¹

¹ Visual Geometry Group, Department of Engineering Science, University of Oxford
² CVIT, IIIT Hyderabad
³ LIGM, École des Ponts, Univ Gustave Eiffel, CNRS
⁴ CMIC, Shanghai Jiao Tong University

Datasets and Results

In this work, we evaluate our model on common AD benchmarks including CMD-AD, MAD-Eval, and TV-AD.

Video Frames

CMD-AD can be downloaded here.
MAD-Eval can be downloaded here.
TV-AD can be downloaded following instructions here.

Ground Truth AD Annotations

All annotations can be found in resources/annotations/.

Predicted ADs

The AD predictions (by Qwen2-VL+LLaMA3 or GPT-4o+GPT-4o) can be downloaded here.

Action Score

We propose a new evaluation metric, named "action score", that focuses on whether a specific ground truth (GT) action is captured within the prediction.

The detailed evaluation code can be found in action_score/.

Audio Description (AD) Generation

Requirements

Basic Dependencies: python>=3.8, pytorch=2.1.2, transformers=4.46.0, Pillow, pandas, decord, opencv
For inference based on open-sourced models, set up path for cache (for Qwen2-VL, LLaMA3, etc.) by modifying os.environ['TRANSFORMERS_CACHE'] = "/path/to/cache/" in stage1/main_qwen2vl.py and stage2/main_llama3.py
For inference based on proprietary GPT-4o models, set up path for API keys by modifying os.environ["OPENAI_API_KEY"] = <open-api-key> in stage1/main_gpt4o.py and stage2/main_gpt4o.py

Preprocessing

To structure the context frames according to shots, as well as recognise characters in each shot, please refer to guideline in preprocess/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4.csv)

Film Grammar Prediction

To predict the film grammar including shot scales and thread structures, please follow the steps detailed in film_grammar/.
(This step can be skipped by directly referred to the pre-computed results in the form resources/annotations/{dataset}_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv)

Inference

- Generating Dense Description by VLM (Stage I)

python stage1/main_qwen2vl.py \  # or stage1/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \            # e.g., "cmdad"
--anno_path={anno_path}          # e.g., "resources/annotations/cmdad_anno_context-3.0-8.0_face-0.2-0.4_scale_thread.csv" \
--charbank_path={charbank_path}  # e.g., "resources/charbanks/cmdad_charbank.json" \
--video_dir={video_dir} \
--save_dir="{save_dir} \
--font_path="resources/fonts/times.ttf" \
--shot_label

--dataset: choices are cmdad, madeval, and tvad.
--anno_path: path to AD annotations (with character recognition results and film grammar predictions), available in resources/annotations.
--charbank_path: path to external character banks, available in resources/charbanks/.
--video_dir: directory of video datasets, example file structures can be found in resources/example_file_structures (files are empty, for references only).
--save_dir: directory to save output csv.
--font_path: path to font file for shot labels (default is Times New Roman)
--shot_label: add shot number label at the top-left of each frame

- Generating AD Sentence by LLM (Stage II)

python stage2/main_llama3.py \  # or stage2/main_gpt4o.py to run with GPT-4o
--dataset={dataset} \           # e.g., "cmdad"
--mode={mode} \                 # e.g., "single"
--pred_path={pred_path} \       
--save_dir={save_dir}

--dataset: choices are cmdad, madeval, and tvad.
--mode: single for single AD output; assistant for five candidate AD outputs
--pred_path: path to the stage1 saved csv file.
--save_dir: directory to save output csv.

Citation

If you find this repository helpful, please consider citing our work! 😊

@InProceedings{xie2025shotbyshot,
    title     =	{Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
    author    = {Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
    booktitle = {ICCV},  
    year      = {2025}
}

References

Qwen2-VL: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
LLaMA3: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
GPT-4o: https://openai.com/api/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Shot🎞️-by-Shot🎞️: Film-Grammar-Aware Training-Free Audio Description Generation

Datasets and Results

Video Frames

Ground Truth AD Annotations

Predicted ADs

Action Score

Audio Description (AD) Generation

Requirements

Preprocessing

Film Grammar Prediction

Inference

- Generating Dense Description by VLM (Stage I)

- Generating AD Sentence by LLM (Stage II)

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
action_score		action_score
film_grammar		film_grammar
preprocess		preprocess
resources		resources
stage1		stage1
stage2		stage2
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

License

Jyxarthur/shot-by-shot

Folders and files

Latest commit

History

Repository files navigation

Shot🎞️-by-Shot🎞️: Film-Grammar-Aware Training-Free Audio Description Generation

Datasets and Results

Video Frames

Ground Truth AD Annotations

Predicted ADs

Action Score

Audio Description (AD) Generation

Requirements

Preprocessing

Film Grammar Prediction

Inference

- Generating Dense Description by VLM (Stage I)

- Generating AD Sentence by LLM (Stage II)

Citation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages