🚀🚀🚀 Official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?
- Authors: Xilin Wei*, Xiaoran Liu*, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
- Institutes: Fudan University; Shanghai AI Laboratory; Shanghai Innovation Institute
- Resources: [📖Paper] [🏠Project Page] [🤗Huggingface]
- 🔥 Four Key Positional Encoding Schemes: We present an analysis of four key properties essential for RoPE when applied to video. Motivated by this analysis, we propose VideoRoPE including Low-frequency Temporal Allocation (LTA), Diagonal Layout (DL), and Adjustable Temporal Spacing (ATS) to satisfy all four properties.
- 🔥 A Challenging Video Haystack Retrieval Benchmark: We introduce the challenging V-NIAH-D task to expose the drawbacks of current position embedding designs regarding frequency allocation. Our findings reveal that existing Video LLMs are easily misled to frequency-based distractors.
- 🔥 Excellent Performance: Extensive experiments demonstrate that VideoRoPE consistently achieves superior performance compared to other RoPE variants. For example, VideoRoPE outperforms previous M-RoPE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on LongVideoBench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on VideoHallucer) benchmarks.
[2025/7/2] VideoRoPE++ is released with the training-free extrapolation method YaRN-V and the comprehensive V-RULER benchmark. See paper (currently on hold at arXiv, temporarily hosted here) and code.
[2025/6/7] VideoRoPE is selected as ICML 2025 🌟Oral!
[2025/3/7] The V-NIAH-D benchmark, checkpoints and training data have been released on Huggingface.
[2025/3/7] The training code has been added to the repository, please check it out.
[2025/2/14] Code and Project Page are released!
- VideoRoPE Implementation with transformers
- VideoRoPE Implementation with vLLM
- V-NIAH-D Release
- Checkpoints Release
- Evaluation Code Release
- VideoRoPE++ Paper Release
- VideoRoPE++ Code Release
- VideoRoPE++ V-RULER huggingface Release
-
Required Package Versions
transformers 4.45.2 vllm 0.6.3.post2.dev171+g890ca360 -
The implementation of videorope (both transformers and vllm) is emphasized with #!, and you can easily find it by pressing ctrl + F.
-
For transformer inference:
with torch.inference_mode(): generated_ids = model.generate( ..., which_rope=which_rope, scale_factor=scale_factor ) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) generated_text = output_text[0] -
For vLLM inference:
mm_data['which_rope'] = which_rope mm_data['scale_factor'] = scale_factor llm_inputs = { "prompt": prompt, "multi_modal_data": mm_data, } with torch.no_grad(): outputs = llm.generate([llm_inputs], sampling_params=sampling_params) generated_text = outputs[0].outputs[0].text
To verify the superiority of VideoRoPE, we use the diverse and high-quality video dataset LLaVA-Video-178K for video fine-tuning. To balance training efficiency and long-video comprehension, we randomly select 136K videos with durations under 2 minutes and 18K videos with durations between 2 and 3 minutes.
Once the data is prepared, one can fine-tune model following the training data format of LLaMA-Factory:
cd LLaMA-Factory
sh multi_gpu_sft_slurm.shIt is important to note that in order to align with the training format of Qwen2-VL, we mainly made adjustments to LLaMA-Factory/src/llamafactory/data/mm_plugin.py.
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
@inproceedings{wei2025videorope,
title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?},
author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and others},
booktitle={International Conference on Machine Learning},
year={2025}
}
@misc{wei2025videoropepp,
author = {Xilin Wei and Xiaoran Liu and Yuhang Zang and Shengyuan Ding and Xiaoyi Dong and Yuhang Cao and Haodong Duan and Qipeng Guo and Jiaqi Wang and Xipeng Qiu and Dahua Lin},
title = {VideoRoPE++: Towards Better Video Rotary Position Embedding},
year = {2025},
howpublished = {\url{https://github.com/Wiselnn570/VideoRoPE/blob/main/videorope_plus/VideoRoPE_plus.pdf}},
doi={10.5281/zenodo.16529245}
}- transformers: the codebase we built upon. Thanks for their wonderful work.
- vLLM: an excellent open-source codebase for high-throughput and memory-efficient inference. Thanks for their wonderful work.
- Qwen2-VL: the amazing open-sourced multimodal large language model!
- LLaMA-Factory: Wonderful job in facilitating LLMs & VLMs training.

