🔥🔥OVO-Bench is accepted by CVPR 2025!🔥🔥
Important Note: Current codebase is modified compared to our initial arXiv paper. We strongly recommend that any use of OVO-Bench should be based on current edition.
- Backward Tracing: trace back to past events to answer the question.
- Real-Time Visual Perception: understand and respond to events as they unfold at the current timestamp.
- Forward Active Responding: delay the response until sufficient future information becomes available to answer the question accurately.
OVO-Bench evaluates Video-LLMs' ability to find temporal visual clues from ongoing input, allowing models to wait for sufficient evidence before responding. We term this approach the Video Chain-of-Time thinking process, analogous to Chain-of-Thought reasoning in LLMs.
- 644 videos
- 3,100 Queries
- 263.42s Average query timestamp.
Following modules are required for inference and scoring pipeline.
moviepy==1.0.3
numpy
pillow
tqdmOr run pip insall -r requirements to install all required modules.
- Download src videos and chunk videos locally:
- Download
src_videos.tar.parta[a~e](~44GB) from huggingface-repo - Place it under
./dataand then concat and untar all files - Run
bash scripts/chunk_video.shto get all chunked video clips.
- Download
- (Recommend) Downloaded our pre-chunked video clips:
- Download
chunked_videos.tar.parta[a~o](~144GB) from huggingface-repo - Place it under
./dataand then concat and untar all files.
- Download
We divide our evaluation pipeline into two parts: inference and score. For our released models, run our provided scripts under ./scripts directory. For example, for InternVL2, run:
bash scripts/inference/Gemini.shAll inference results will be saved under ./results/[MODEL_NAME]. Then run our scoring scripts:
bash scripts/score/Gemini.shScores will show in cli:
Offline Model: Gemini
Evaluate Backward Tracing...
Task: ASI, Acc: 76.35
Task: HLD, Acc: 52.69
Task: EPM, Acc: 58.59
Backward Avg.: 62.54
Evaluate Real-time Visual Perception...
Task: ATR, Acc: 79.31
Task: ACR, Acc: 66.97
Task: OCR, Acc: 85.91
Task: STU, Acc: 58.43
Task: OJR, Acc: 61.96
Task: FPD, Acc: 63.37
Realtime Avg.: 69.32
Evaluate Forward Active Responding...
Task: REC, Acc: 35.53
Task: SSR, Acc: 74.24
Task: CRR, Acc: 61.67
Forward Avg.: 57.15
Total Avg.: 63.00To evaluate your own models, inherit OVOBenchOffline/Online class in ./utils/OVOBench.py and implement your own inference pipeline. Refer to our provided models under ./models for further details.
OVO-Bench is released under CC BY-NC-SA 4.0 license. By downloading our dataset from our website or other sources, the user agrees to adhere to the terms of CC BY-NC-SA 4.0 and licenses of the source datasets
@misc{li2025ovobenchfarvideollmsrealworld,
title={OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?},
author={Yifei Li and Junbo Niu and Ziyang Miao and Chunjiang Ge and Yuanhang Zhou and Qihao He and Xiaoyi Dong and Haodong Duan and Shuangrui Ding and Rui Qian and Pan Zhang and Yuhang Zang and Yuhang Cao and Conghui He and Jiaqi Wang},
year={2025},
eprint={2501.05510},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.05510},
}




