Yanrui Yu, Tianfei Zhou, Jiaxin Sun, Lianpeng Qiao, Lizhong Ding, Ye Yuan, and Guoren Wang
- [2025/07/25] Our paper was accepted at ACM Multimedia 2025 (MM '25), and we have released the code on GitHub!
In modern urban environments, camera networks generate massive amounts of operational footage -- reaching petabytes each day -- making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility.
In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language.
Particularly, we build LAVA, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. LAVA comprises three main components:
- A multi-armed bandit-based efficient sampling method for video segment-level localization
- A video-specific open-world detection module for object-level retrieval
- A long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests
To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos.
Experiments on this benchmark demonstrate that LAVA improves F1-scores for selection queries by 14%, reduces MAPE for aggregation queries by 0.39, and achieves top-k precision of 86%, while processing videos 9.6× faster than the most accurate baseline.
To run this code, you will need Python 3.8 and the following dependencies:
- Install dassl:
# Clone this repo
git clone https://github.com/KaiyangZhou/Dassl.pytorch.git
cd Dassl.pytorch/
# Install dependencies
pip install -r requirements.txt
# Install this library (no need to re-build if the source code is modified)
python setup.py develop
cd ..- Install other dependencies:
pip install -r requirements.txt bash scripts/pipline.sh $DATASET $PREDICATEThe LAVA Dataset is now officially released on HuggingFace! You can access the complete dataset (including video files, annotation files, and usage guides) via the following link:
https://huggingface.co/datasets/xiaoyu123hhh/Lava_Dataset
The dataset covers 6 locations (amsterdam, caldot1, caldot2, jackson, shibuya, warsaw), with separate train and test splits. Each split includes MP4 videos, corresponding label.json annotation files, and supports frame extraction & bounding box visualization through the provided scripts. For detailed usage (e.g., frame extraction, annotation parsing, and visualized examples), please refer to the dataset documentation on the HuggingFace page.