Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan
WorldCache is a training-free, plug-and-play inference acceleration framework for diffusion-based video world models. It achieves up to 3.0Γ speedup while strictly maintaining temporal coherence and visual fidelity.
High-fidelity video generation on Cosmos-Predict 2.5 (14B) with upto 3.0x speedup.
Video World Models increasingly rely on large-scale diffusion transformers to simulate complex spatial dynamics. However, the high computational cost of autoregressive generation remains a significant bottleneck. WorldCache overcomes this by identifying temporal and spatial redundancies in the denoising process.
Tip
WorldCache is backbone-agnostic and training-free. It can be integrated into existing diffusion pipelines with just a few lines of code.
Unlike naive caching which causes "motion drift," WorldCache uses a suite of content-aware modules like Causal Feature Caching (CFC), Saliency-Weighted Drift (SWD), Optimal Feature Approximation (OFA), and Adaptive Threshold Scheduling (ATS) to predict skipped computation rather than blindly copying it. Our method generalizes across leading architectures like NVIDIA Cosmos, WAN2.1, and DreamDojo.
WorldCache is driven by four key technical ideologies:
| Module | Icon | Description |
|---|---|---|
| Causal Feature Caching (CFC) | β‘ | Dynamically scales caching tolerance based on early layer motion velocity. |
| Saliency-Weighted Drift (SWD) | π― | Penalizes caching errors in perceptually critical high-frequency regions. |
| Optimal Feature Approx. (OFA) | π | Interpolates skipped cache states using trajectory matching and optical flow. |
| Adaptive Threshold Scheduling (ATS) | π | Exponentially relaxes caching constraints in later denoising stages. |
WorldCache treats caching like a localized prediction. It controls the pace with causal tracking while interpolating the next state.
-
Drift Probing: Uses the first
$K$ blocks of the transformer as a lightweight proxy for global drift. -
Motion-Adaptive Thresholds: Uses
$\alpha$ -scaled motion signals to prevent "ghosting" artifacts in high-dynamics scenes. - Saliency Mapping: Weights L1 drift by spatial saliency (channel-wise variance) to preserve fine details.
For detailed system requirements, environment setup (Virtual Env/Docker), and checkpoint downloading instructions, please refer to our:
π Detailed Setup Guide
# 1. Create and activate conda environment
conda create -n worldcache python=3.10 -y
conda activate worldcache
# 2. Sync dependencies with UV
curl -LsSf https://astral.sh/uv/install.sh | sh
cd Models/Cosmos-Predict2.5/
uv sync --extra=cu128 --active --inexact
# 3. Basic inference (from root)
python Models/Cosmos-Predict2.5/examples/inference.py --model 2B/post-trained --worldcache_enabled [options]To generate high-quality video with WorldCache acceleration:
# From the root of the repository - Text2World
CUDA_VISIBLE_DEVICES=0 python Models/Cosmos-Predict2.5/examples/inference.py \
-i Models/Cosmos-Predict2.5/path/to/prompt.json \
-o outputs/worldcache_output \
--inference-type=text2world \
--model 2B/post-trained \
--disable-guardrails \
--worldcache_enabled \
--worldcache_motion_sensitivity 2.0 \
--worldcache_flow_enabled \
--worldcache_flow_scale 2.0 \
--worldcache_osi_enabled \
--worldcache_saliency_enabled \
--worldcache_saliency_weight 1.0 \
--worldcache_dynamic_decay
# From the root of the repository - Image2World
CUDA_VISIBLE_DEVICES=0 python Models/Cosmos-Predict2.5/examples/inference.py \
-i Models/Cosmos-Predict2.5/path/to/prompt.json \
-o outputs/worldcache_output \
--inference-type=image2world \
--model 2B/post-trained \
--disable-guardrails \
--worldcache_enabled \
--worldcache_motion_sensitivity 2.0 \
--worldcache_flow_enabled \
--worldcache_flow_scale 2.0 \
--worldcache_osi_enabled \
--worldcache_saliency_enabled \
--worldcache_saliency_weight 1.0 \
--worldcache_dynamic_decayWorldCache establishes a new state-of-the-art for training-free diffusion acceleration, maintaining near-baseline quality while significantly reducing latency.
| Model Family | Scales | Architecture | PAI-Bench | EgoDex-Eval |
|---|---|---|---|---|
| Cosmos-Predict 2.5 | 2B, 14B | DiT | β | β |
| WAN2.1 | 1.3B, 14B | DiT | β | β |
| DreamDojo | 2B | DiT | β | β |
Across two major architectures (Cosmos and WAN), WorldCache consistently delivers >2Γ speedup with <1% drop in overall physical reasoning scores.
Table 1: PAI-Bench Text-to-World (T2W) Results
| Model | Method | Domain Avg | Quality Avg | Overall | Latency (s) | Speedup |
|---|---|---|---|---|---|---|
| Cosmos 2B | Baseline | 0.767 | 0.728 | 0.748 | 54.34 | 1.00Γ |
| DiCache | 0.759 | 0.727 | 0.743 | 40.82 | 1.3Γ | |
| WorldCache | 0.763 | 0.727 | 0.745 | 26.28 | 2.1Γ | |
| Cosmos 14B | Baseline | 0.792 | 0.746 | 0.769 | 216.25 | 1.00Γ |
| DiCache | 0.792 | 0.745 | 0.768 | 148.36 | 1.4Γ | |
| WorldCache | 0.795 | 0.746 | 0.771 | 98.61 | 2.14Γ |
Table 2: PAI-Bench Image-to-World (I2W) Results
| Model | Method | Domain Avg | Quality Avg | Overall | Latency (s) | Speedup |
|---|---|---|---|---|---|---|
| Cosmos 2B | Baseline | 0.845 | 0.761 | 0.803 | 55.04 | 1.00Γ |
| DiCache | 0.835 | 0.752 | 0.794 | 39.68 | 1.40Γ | |
| WorldCache | 0.840 | 0.756 | 0.798 | 24.48 | 2.30Γ | |
| Cosmos 14B | Baseline | 0.860 | 0.769 | 0.814 | 210.07 | 1.00Γ |
| DiCache | 0.855 | 0.767 | 0.811 | 146.04 | 1.40Γ | |
| WorldCache | 0.859 | 0.768 | 0.813 | 99.25 | 2.18Γ |
WorldCache is backbone-agnostic. On the latest WAN2.1 architecture, it achieves superior speed-quality tradeoffs compared to DiCache.
Table 3: WAN2.1 Transfer Results
| Backbone | Method | Overall | Latency (s) | Speedup |
|---|---|---|---|---|
| T2W 1.3B | Baseline | 0.7727 | 120.04 | 1.00Γ |
| DiCache | 0.7703 | 61.57 | 1.96Γ | |
| WorldCache | 0.7721 | 50.84 | 2.36Γ | |
| I2W 14B | Baseline | 0.7384 | 475.60 | 1.00Γ |
| DiCache | 0.7311 | 291.91 | 1.53Γ | |
| WorldCache | 0.7388 | 206.73 | 2.31Γ |
In egocentric robotics tasks requiring high spatial precision, WorldCache maintains frame-level fidelity (PSNR/SSIM) while enabling real-time-friendly inference.
Table 4: EgoDex-Eval (Robotics Evaluation)
| Model | Method | PSNR | SSIM | LPIPS | Latency (s) | Speedup |
|---|---|---|---|---|---|---|
| WAN2.1-14B | Baseline | 13.30 | 0.503 | 0.459 | 391.90 | 1.00Γ |
| DiCache | 12.95 | 0.491 | 0.461 | 208.60 | 1.88Γ | |
| WorldCache | 13.19 | 0.498 | 0.460 | 171.60 | 2.30Γ | |
| Cosmos-2.5-2B | Baseline | 12.87 | 0.455 | 0.518 | 70.01 | 1.00Γ |
| DiCache | 12.63 | 0.445 | 0.531 | 51.97 | 1.34Γ | |
| WorldCache | 12.82 | 0.466 | 0.518 | 43.24 | 1.62Γ | |
| DreamDojo-2B | Baseline | 23.63 | 0.775 | 0.226 | 19.73 | 1.00Γ |
| DiCache | 20.41 | 0.734 | 0.252 | 12.46 | 1.58Γ | |
| WorldCache | 23.69 | 0.737 | 0.251 | 10.36 | 1.90Γ |
WorldCache scales effectively with the denoising step budget. For longer generation trajectories (more steps), the efficiency gains increase as the underlying motion manifold stabilizes.
Efficiency scaling: WorldCache achieves up to 3.1Γ speedup as denoising steps increase, while maintaining superior quality over DiCache.
WorldCache maintains flawless temporal coherence across diverse domains, from urban traffic to precision robotics.
Main comparison: WorldCache stays closer to baseline rollout in dynamic and interaction-heavy regions.
We acknowledge the following works that inspired this project:
- Cosmos-Predict 2.5 β NVIDIA's world foundation model platform.
- WAN2.1 β Open suite of video foundation models.
- DreamDojo β Generalist robot world model from the NVIDIA GEAR Team.
- DiCache β DiCache: Let Diffusion Model Determine Its Own Cache, ICLR 2026.
@inproceedings{nawaz2026worldcache,
title = {WorldCache: Content-Aware Caching for Accelerated Video World Models},
author = {Umair Nawaz and Ahmed Heakl and Ufaq Khan and Abdelrahman Shaker and Salman Khan and Fahad Shahbaz Khan},
journal = {arXiv preprint arXiv:2603.22286},
year = {2026}
}This project inherits the Apache 2.0 License from the NVIDIA Cosmos-Predict2 codebase.










