Hidir Yesiltepe1
·
Tuna Han Salih Meral1
·
Adil Kaan Akan2
·
Kaan Oktay2
·
Pinar Yanardag1
1Virginia Tech 2fal
Infinity-RoPE-teaser.1.2.mp4
- [2026 Feb 08] Infinity-RoPE has been accepted at CVPR 2026 🎉!
- [2026 Feb 08] Thanks to @zhuhz22's recommendation, we adapted Causal Forcing checkpoints to Infinity-RoPE!
- [2026 Jan 16] We released the code.
- [2026 Jan 11] LongLive adapted Infinity-RoPE for adapting their long video generator to infinite video generator!
- [2025 Nov 25] We released the paper and the project page.
| Self-Forcing + ∞-RoPE | Causal-Forcing + ∞-RoPE |
|---|---|
harry_potter_sf.mp4 |
harry_potter_cf.mp4 |
We tested this repo on the following setup:
- Nvidia GPU with at least 24 GB memory (RTX 4090, A100, and H100 are tested).
- Linux operating system.
- 64 GB RAM.
Other hardware setup could also work but hasn't been tested.
Create a Python 3.10 environment, install dependencies, and download models:
bash setup_env.sh
bash inference.sh
Infinity-RoPE utilizes a specific syntax to control temporal duration and scene transitions. Examples are provided in prompts/infinity_rope_prompts.txt. The core format for an action segment is:
"action_description[duration]"
| Operator | Name | Function |
|---|---|---|
[Ns] |
Duration | Sets the segment length in seconds (e.g., [10s]). |
| |
Separator | Chains multiple action prompts together. |
# |
Scene Cut | When placed inside brackets (e.g., [10s#]), it triggers a hard cut. |
; |
Subtitle Toggle | Separates action prompts (left) from subtitle text (right). |
Generates one seamless video of the specified length.
"action_1_prompt[30s]"
- Total Length: 30s
- Result: A single 30-second continuous shot.
Transitions between different behaviors within a single, continuous camera shot.
"action_1_prompt[5s] | action_2_prompt[10s] | action_3_prompt[15s]"
-
Total Length: 30s (
$5s + 10s + 15s$ ) - Result: The subject transitions naturally from action 1 to 2 to 3 without a camera break.
Forces the model to perform a hard jump-cut at the beginning of specific segments.
"action_1_prompt[10s] | action_2_prompt[10s#] | action_3_prompt[10s#]"
- Total Length: 30s
- Result: Three distinct 10-second scenes. The
#at the start of action 2 and 3 initiates the scene cuts.
Combines scene cuts with synchronized text overlays.
"action_1[10s] | action_2[10s#] | action_3[10s#] ; subtitle_1 | subtitle_2 | subtitle_3"
- Total Length: 30s
- Result: Three distinct 10-second scenes. Each segment displays its corresponding subtitle from the list provided after the
;.
Note:
- As KV Flush is effectively an index change operation, we found it quite useful to repeat the characteristics of environment and people in every action prompt. See examples in prompts/infinity_rope_prompts.txt
- Our model works better with long, detailed prompts since it's trained with such prompts. We will integrate prompt extension into the codebase (similar to Wan2.1) in the future. For now, it is recommended to use third-party LLMs (such as GPT-4o) to extend your prompt before providing to the model.
- You may want to adjust FPS so it plays smoothly on your device.
- The speed can be improved by enabling
torch.compile, TAEHV-VAE, or using FP8 Linear layers, although the latter two options may sacrifice quality. It is recommended to usetorch.compileif possible and enable TAEHV-VAE if further speedup is needed.
huggingface-cli download gdhe17/Self-Forcing checkpoints/ode_init.pt --local-dir .
huggingface-cli download gdhe17/Self-Forcing vidprom_filtered_extended.txt --local-dir prompts
Note: Our training algorithm (except for the GAN version) is data-free (no video data is needed). For now, we directly provide the ODE initialization checkpoint and will add more instructions on how to perform ODE initialization in the future (which is identical to the process described in the CausVid repo).
torchrun --nnodes=8 --nproc_per_node=8 --rdzv_id=5235 \
--rdzv_backend=c10d \
--rdzv_endpoint $MASTER_ADDR \
train.py \
--config_path configs/self_forcing_dmd.yaml \
--logdir logs/self_forcing_dmd \
--disable-wandb
Our training run uses 600 iterations and completes in under 2 hours using 64 H100 GPUs. By implementing gradient accumulation, it should be possible to reproduce the results in less than 16 hours using 8 H100 GPUs.
This codebase is built on top of the open-source implementation of Self-Forcing. We also appreciate Infinite-Forcing for providing an attention sink checkpoint, and Causal Forcing for providing high dynamic degree & imaging quality checkpoint.
If you find this codebase useful for your research, please kindly cite our paper:
@article{yesiltepe2025infinity,
title={Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout},
author={Yesiltepe, Hidir and Meral, Tuna Han Salih and Akan, Adil Kaan and Oktay, Kaan and Yanardag, Pinar},
journal={arXiv preprint arXiv:2511.20649},
year={2025}
}