Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering 💻
Chenyang Lyu, Tianbo Ji, Yvette Graham, Jennifer Foster
School of Computing, Dublin City University, Dublin, Ireland 🏠
This repository contains the code for the Semantic-aware Dynamic Retrospective-Prospective Reasoning system for Event-level Video Question Answering (EVQA) 💻. The system utilizes explicit semantic connections between questions and visual information at the event level to improve the reasoning process and provide optimal answers 🔍.
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to obtain the visual information needed for optimal answers. However, few studies have focused on utilizing explicit semantic connections between questions and visual information, especially at the event level. In this paper, we propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering. We explicitly incorporate the Semantic Role Labeling (SRL) structure of the question in the dynamic reasoning process, determining which frame to move to based on the focused part of the SRL structure (agent, verb, patient, etc.) 🧐. We evaluate our approach on the TrafficQA benchmark EVQA dataset and demonstrate superior performance compared to previous state-of-the-art models 💪.
Please download the TrafficQA dataset from this link: https://sutdcv.github.io/SUTD-TrafficQA/#/download including videos and corresponding annotations and then move them under data/ directory.
Please use data_preprocess.py to extract frames from videos in the TrafficQA dataset and then tokenize the annotations data to tensor dataset.
Please use the following script to train and evaluate the VideoQA system:
python run_traffic_qa.py --do_train --do_eval --num_train_epochs 2 --n_frames 10 --eval_n_frames 10 --learning_rate 5e-6 --train_batch_size 8 --eval_batch_size 16 --attention_heads 8 --eval_steps 5000Once the model is trained, you can use it for VideoQA tasks. Provide a video, and the system will give the most probable answer based on the video. 🔎
- Python (>=3.8) 🐍
- Pytorch (>=2.0) 🔥
- MoviePy 🧮
- ffmpeg 🐼
Please make sure to install the required dependencies before running the code. ⚙️
Please cite our paper using the bibtex below if you found that our paper is useful to you:
@article{lyu2023semantic,
title={Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering},
author={Lyu, Chenyang and Ji, Tianbo and Graham, Yvette and Foster, Jennifer},
journal={arXiv preprint arXiv:2305.08059},
year={2023}
}