[NeurIPS 2025] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
- 📄 PDF:
neurips_camera_ready_1_token_per_frame_compression.pdf - 🎥 SlidesLive:Link to Presentation
Enhancing long video understanding via extreme compression by progressively reducing each selected frame to a single token.
Progressively compress video tokens to one token per frame. Achieve more comprehence long video understanding.
XComp is a fine-tuned model from VideoChat-Flash-2B. The environment and the data are the same. Please refer to VideoChat-Flash for installation and data preparation.
- Training
./llava-train_videochat - Evaluate
./lmms-eval_videochat
Download model parameters: Google Drive, save to XComp/llava-train_videochat/checkpoints/baseline_1000frame_cos/stagesuf-umt-hd-large-tome16_mlp_hd64_Qwen2_5_1_5B_stage3_short-long_mix_sft_mid2.yaml/
@inproceedings{
zhang2025one,
title={One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding},
author={Zheyu Aqa Zhang and Ziqi Pang and Shixing Chen and Xiang Hao and Vimal Bhat and Yu-Xiong Wang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=bythzT0b81}
}
This work was supported in part by Amazon, NSF under Grants 2106825 and 2519216, and the DARPA Young Faculty Award. This work used computational resources, including Amazon Web Services (AWS), and the NCSA Delta and DeltaAI supercomputers through allocation CIS230012 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program.
We gratefully acknowledge the open-source projects that form the foundation of XComp: VideoChat-Flash, Qwen, and LLaVA-Video.
We also thank the open-source of relevant projects: UMT, lmms-eval, transformers, ToMe, PyramidDrop, LongVideoBench, MLVU, VideoMME, and LVBench.