VideoRoPE: What Makes for Good Video Rotary Position Embedding?
[ICML2025 (Oral)]

🚀🚀🚀 Official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Authors: Xilin Wei*, Xiaoran Liu*, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Institutes: Fudan University; Shanghai AI Laboratory; Shanghai Innovation Institute
Resources: [📖Paper] [🏠Project Page] [🤗Huggingface]

💡 Highlights

🔥 Four Key Positional Encoding Schemes: We present an analysis of four key properties essential for RoPE when applied to video. Motivated by this analysis, we propose VideoRoPE including Low-frequency Temporal Allocation (LTA), Diagonal Layout (DL), and Adjustable Temporal Spacing (ATS) to satisfy all four properties.
🔥 A Challenging Video Haystack Retrieval Benchmark: We introduce the challenging V-NIAH-D task to expose the drawbacks of current position embedding designs regarding frequency allocation. Our findings reveal that existing Video LLMs are easily misled to frequency-based distractors.
🔥 Excellent Performance: Extensive experiments demonstrate that VideoRoPE consistently achieves superior performance compared to other RoPE variants. For example, VideoRoPE outperforms previous M-RoPE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on LongVideoBench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on VideoHallucer) benchmarks.

📜 News

[2025/7/2] VideoRoPE++ is released with the training-free extrapolation method YaRN-V and the comprehensive V-RULER benchmark. See paper (currently on hold at arXiv, temporarily hosted here) and code.

[2025/6/7] VideoRoPE is selected as ICML 2025 🌟Oral!

[2025/3/7] The V-NIAH-D benchmark, checkpoints and training data have been released on Huggingface.

[2025/3/7] The training code has been added to the repository, please check it out.

[2025/2/14] Code and Project Page are released!

👨‍💻 Todo

🛠️ Usage

Required Package Versions

transformers 4.45.2
vllm 0.6.3.post2.dev171+g890ca360

The implementation of videorope (both transformers and vllm) is emphasized with #!, and you can easily find it by pressing ctrl + F.

For transformer inference:

with torch.inference_mode():
    generated_ids = model.generate(
      ..., 
      which_rope=which_rope,
      scale_factor=scale_factor
    )
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    generated_text = output_text[0]

For vLLM inference:

mm_data['which_rope'] = which_rope
mm_data['scale_factor'] = scale_factor
llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}
with torch.no_grad():
    outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

Train

To verify the superiority of VideoRoPE, we use the diverse and high-quality video dataset LLaVA-Video-178K for video fine-tuning. To balance training efficiency and long-video comprehension, we randomly select 136K videos with durations under 2 minutes and 18K videos with durations between 2 and 3 minutes.

Once the data is prepared, one can fine-tune model following the training data format of LLaMA-Factory:

cd LLaMA-Factory
sh multi_gpu_sft_slurm.sh

It is important to note that in order to align with the training format of Qwen2-VL, we mainly made adjustments to LLaMA-Factory/src/llamafactory/data/mm_plugin.py.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@inproceedings{wei2025videorope,
  title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?},
  author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and others},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

@misc{wei2025videoropepp,
  author       = {Xilin Wei and Xiaoran Liu and Yuhang Zang and Shengyuan Ding and Xiaoyi Dong and Yuhang Cao and Haodong Duan and Qipeng Guo and Jiaqi Wang and Xipeng Qiu and Dahua Lin},
  title        = {VideoRoPE++: Towards Better Video Rotary Position Embedding},
  year         = {2025},
  howpublished = {\url{https://github.com/Wiselnn570/VideoRoPE/blob/main/videorope_plus/VideoRoPE_plus.pdf}},
  doi={10.5281/zenodo.16529245}
}

❤️ Acknowledgments

transformers: the codebase we built upon. Thanks for their wonderful work.
vLLM: an excellent open-source codebase for high-throughput and memory-efficient inference. Thanks for their wonderful work.
Qwen2-VL: the amazing open-sourced multimodal large language model!
LLaMA-Factory: Wonderful job in facilitating LLMs & VLMs training.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
LLaMA-Factory		LLaMA-Factory
assets/images		assets/images
eval		eval
scripts		scripts
videorope-transformer		videorope-transformer
videorope-vllm		videorope-vllm
videorope_plus		videorope_plus
vision_niah_d		vision_niah_d
LICENSE		LICENSE
README.md		README.md
VideoRoPE_plus.pdf		VideoRoPE_plus.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
[ICML2025 (Oral)]

💡 Highlights

📜 News

👨‍💻 Todo

🛠️ Usage

Train

✒️ Citation

❤️ Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

Wiselnn570/VideoRoPE

Folders and files

Latest commit

History

Repository files navigation

VideoRoPE: What Makes for Good Video Rotary Position Embedding? [ICML2025 (Oral)]

💡 Highlights

📜 News

👨‍💻 Todo

🛠️ Usage

Train

✒️ Citation

❤️ Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
[ICML2025 (Oral)]

Packages