Haoji Zhang*, Yiqin Wang*, Yansong Tang✉, Yong Liu, Jiashi Feng, Xiaojie Jin✉†
*Equally contributing first authors, ✉Correspondence, †Project Leader
Work done when interning at Bytedance.
We proposed Flash-VStream, an efficient VLM with a novel Flash Memory mechanism that enables real-time understanding and Q&A of extremely long video streams. Our model achieves outstanding accuracy and efficiency on EgoSchema, MLVU, LVBench, MVBench and Video-MME Benchmarks.
-
[2025/6/26] 🔥 [ICCV 2025] Flash-VStream-Qwen is coming! We release the homepage, paper, Code, and model.
-
[2024/6/15] 🏅 Our team won the 1st Place at Long-Term Video Question Answering Challenge of LOVEU Workshop@CVPR'24. Here is our certification. We used a Hierarchical Memory model based on Flash-VStream-7b.
-
[2024/06/12] Flash-VStream-LLaVA is coming! We release the homepage, paper, code and model for Flash-VStream. We release the dataset for VStream-QA benchmark.
See Flash-VStream-Qwen/README.md.
See Flash-VStream-LLaVA/README.md.
If you find this project useful in your research, please consider citing:
@article{zhang2025flashvstream,
title={Flash-VStream: Efficient Real-Time Understanding for Long Video Streams},
author={Haoji Zhang and Yiqin Wang and Yansong Tang and Yong Liu and Jiashi Feng and Xiaojie Jin},
journal={arXiv preprint arXiv:2506.23825},
year={2025},
}
@article{zhang2024flashvstream,
title={Flash-vstream: Memory-based real-time understanding for long video streams},
author={Zhang, Haoji and Wang, Yiqin and Tang, Yansong and Liu, Yong and Feng, Jiashi and Dai, Jifeng and Jin, Xiaojie},
journal={arXiv preprint arXiv:2406.08085},
year={2024}
}
We would like to thank the following repos for their great work:
- This work is built upon the LLaVA.
- This work utilizes LLMs from Vicuna.
- Some code is borrowed from LLaMA-VID.
- We perform video-based evaluation from Video-ChatGPT.
This project is licensed under the Apache-2.0 License.

