👋👋👋 A collection of resources related to Large Language Models in video anomaly detection🚨.
📌 More details please refer to our paper.
🛠️ Please let us know if you find out a mistake or have any suggestions by e-mail: [email protected]
If you find our work useful for your research, please cite the following paper:
@article{ding2024quo,
title={Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight},
author={Ding, Xi and Wang, Lei},
journal={arXiv preprint arXiv:2412.18298},
year={2024}
}- [27/12/2024] 🎁The GitHub repository for our paper has been released.
- [25/12/2024] 🎄Our paper has been published on arXiv.
- VAD-LLM
![]() (a) Temporal modeling |
![]() (b) Interpretability |
![]() (c) Training-free |
![]() (d) Open-world |
We present a systematic evaluation of 13 closely related works from 2024 that use large language models (LLMs) and vision-language models (VLMs) for video anomaly detection (VAD). The analysis is organized around four key perspectives: (a) temporal modeling, (b) interpretability, (c) training-free, and (d) open-world detection, each represented by a subfigure. For each perspective, we highlight the strategies used, key strengths, limitations, and outline promising directions for future research. The video frames used in the analysis are sourced from the MSAD dataset.
We compare recent approaches in VAD, highlighting key aspects such as interpretability, temporal modeling, few-shot learning, and open-world detection. Performance is evaluated across six benchmark datasets: UCSD Ped2 (Ped2), CUHK Avenue (CUHK), ShanghaiTech (ShT), UCF-Crime (UCF), XD-Violence (XD), and UBnormal (UB). Datasets evaluated using Area Under the Curve (AUC) include Ped2, CUHK, ShT, UCF, and UB, while the XD dataset is evaluated using Average Precision (AP).
| Method | Code | LLM/VLM | Interpret. | Temporal | Few-shot | Open-world | Ped2 | CUHK | ShT | UCF | XD | UB |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VLAVAD | - | Fine-tuning | ✅ | ✅ | 99.0 | 87.6 | 87.2 | -- | -- | -- | ||
| VADor | - | Fine-tuning | ✅ | ✅ | -- | -- | -- | 88.1 | -- | -- | ||
| OVVAD | - | Fine-tuning | ✅ | ✅ | -- | -- | -- | 86.4 | 66.5 | 62.9 | ||
| LAVAD | GitHub | Training-free | ✅ | ✅ | ✅ | -- | -- | -- | 80.3 | 62.0 | -- | |
| TPWNG | - | Fine-tuning | ✅ | -- | -- | -- | 87.8 | 83.7 | -- | |||
| Holmes-VAD | GitHub | Fine-tuning | ✅ | ✅ | -- | -- | -- | 89.5 | 90.7 | -- | ||
| AnomalyRuler | GitHub | Fine-tuning | ✅ | 97.9 | 89.7 | 85.2 | -- | -- | 71.9 | |||
| STPrompt | - | Fine-tuning | ✅ | ✅ | -- | -- | 97.8 | 88.1 | -- | 64.0 | ||
| Holmes-VAU | GitHub | Fine-tuning | ✅ | ✅ | -- | -- | -- | 89.0 | 87.7 | -- | ||
| VERA | - | Training-free | ✅ | -- | -- | -- | 86.6 | 88.2 | -- |
The figure present the most popular sampling strategies for video tasks.
Comparison of different sampling strategies for temporal reasoning.
| Sampling | Interval | Frame Count | Redundancy | Target Use Case | Cost |
|---|---|---|---|---|---|
| Uniform | Fixed | Medium | Medium | Global trend | High |
| Random | Random | Medium | Low | Data augmentation | High |
| Key frame | Adaptive | Low to Med. | Low | Key event extraction | Medium |
| Dense | One | High | High | Fine-grained modeling | Low |
| Sliding window | Adaptive | Medium | Medium | Local temporal details | Medium |
| Adaptive | Dynamic | High | Low | Comprehensive modeling | Medium |
We warmly invite everyone to contribute to this repository and help enhance its quality and scope. Feel free to submit pull requests to add new papers, projects, or other useful resources, as well as to correct any errors you discover. To ensure consistency, please format your pull requests using our tables' structures. We greatly appreciate your valuable contributions and support!




