Skip to content

umair1221/WorldCache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

52 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WorldCache: Content-Aware Caching for Accelerated Video World Models

arXiv Project Website License PyTorch

Oryx Video-ChatGPT

Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan

WorldCache is a training-free, plug-and-play inference acceleration framework for diffusion-based video world models. It achieves up to 3.0Γ— speedup while strictly maintaining temporal coherence and visual fidelity.


🎬 Qualitative Preview

High-fidelity video generation on Cosmos-Predict 2.5 (14B) with upto 3.0x speedup.

Robot 154 AV Scene
Misc Scene Robot 034

WorldCache Teaser


πŸ“– Abstract

Video World Models increasingly rely on large-scale diffusion transformers to simulate complex spatial dynamics. However, the high computational cost of autoregressive generation remains a significant bottleneck. WorldCache overcomes this by identifying temporal and spatial redundancies in the denoising process.

Tip

WorldCache is backbone-agnostic and training-free. It can be integrated into existing diffusion pipelines with just a few lines of code.

Unlike naive caching which causes "motion drift," WorldCache uses a suite of content-aware modules like Causal Feature Caching (CFC), Saliency-Weighted Drift (SWD), Optimal Feature Approximation (OFA), and Adaptive Threshold Scheduling (ATS) to predict skipped computation rather than blindly copying it. Our method generalizes across leading architectures like NVIDIA Cosmos, WAN2.1, and DreamDojo.


✨ Key Components

WorldCache is driven by four key technical ideologies:

Module Icon Description
Causal Feature Caching (CFC) ⚑ Dynamically scales caching tolerance based on early layer motion velocity.
Saliency-Weighted Drift (SWD) 🎯 Penalizes caching errors in perceptually critical high-frequency regions.
Optimal Feature Approx. (OFA) 🌊 Interpolates skipped cache states using trajectory matching and optical flow.
Adaptive Threshold Scheduling (ATS) πŸ“ˆ Exponentially relaxes caching constraints in later denoising stages.

πŸ”¬ Method Overview

WorldCache treats caching like a localized prediction. It controls the pace with causal tracking while interpolating the next state.

WorldCache Pipeline

πŸ› οΈ Technical Highlights

  • Drift Probing: Uses the first $K$ blocks of the transformer as a lightweight proxy for global drift.
  • Motion-Adaptive Thresholds: Uses $\alpha$-scaled motion signals to prevent "ghosting" artifacts in high-dynamics scenes.
  • Saliency Mapping: Weights L1 drift by spatial saliency (channel-wise variance) to preserve fine details.

βš™οΈ Installation & Setup

For detailed system requirements, environment setup (Virtual Env/Docker), and checkpoint downloading instructions, please refer to our:

πŸ‘‰ Detailed Setup Guide

⚑ Quick Summary (Conda + UV)

# 1. Create and activate conda environment
conda create -n worldcache python=3.10 -y
conda activate worldcache

# 2. Sync dependencies with UV
curl -LsSf https://astral.sh/uv/install.sh | sh
cd Models/Cosmos-Predict2.5/
uv sync --extra=cu128 --active --inexact

# 3. Basic inference (from root)
python Models/Cosmos-Predict2.5/examples/inference.py --model 2B/post-trained --worldcache_enabled [options]

πŸš€ Quick Start

To generate high-quality video with WorldCache acceleration:

# From the root of the repository - Text2World
CUDA_VISIBLE_DEVICES=0 python Models/Cosmos-Predict2.5/examples/inference.py \
  -i Models/Cosmos-Predict2.5/path/to/prompt.json \
  -o outputs/worldcache_output \
  --inference-type=text2world \
  --model 2B/post-trained \
  --disable-guardrails \
  --worldcache_enabled \
  --worldcache_motion_sensitivity 2.0 \
  --worldcache_flow_enabled \
  --worldcache_flow_scale 2.0 \
  --worldcache_osi_enabled \
  --worldcache_saliency_enabled \
  --worldcache_saliency_weight 1.0 \
  --worldcache_dynamic_decay

# From the root of the repository - Image2World
CUDA_VISIBLE_DEVICES=0 python Models/Cosmos-Predict2.5/examples/inference.py \
  -i Models/Cosmos-Predict2.5/path/to/prompt.json \
  -o outputs/worldcache_output \
  --inference-type=image2world \
  --model 2B/post-trained \
  --disable-guardrails \
  --worldcache_enabled \
  --worldcache_motion_sensitivity 2.0 \
  --worldcache_flow_enabled \
  --worldcache_flow_scale 2.0 \
  --worldcache_osi_enabled \
  --worldcache_saliency_enabled \
  --worldcache_saliency_weight 1.0 \
  --worldcache_dynamic_decay

πŸ“Š Quantitative Results

WorldCache establishes a new state-of-the-art for training-free diffusion acceleration, maintaining near-baseline quality while significantly reducing latency.


🌐 Model & Benchmark Coverage

Model Family Scales Architecture PAI-Bench EgoDex-Eval
Cosmos-Predict 2.5 2B, 14B DiT βœ… βœ…
WAN2.1 1.3B, 14B DiT βœ… βœ…
DreamDojo 2B DiT β€” βœ…

1. PAI-Bench: Physical Reasoning Benchmarks

Across two major architectures (Cosmos and WAN), WorldCache consistently delivers >2Γ— speedup with <1% drop in overall physical reasoning scores.

Table 1: PAI-Bench Text-to-World (T2W) Results

Model Method Domain Avg Quality Avg Overall Latency (s) Speedup
Cosmos 2B Baseline 0.767 0.728 0.748 54.34 1.00Γ—
DiCache 0.759 0.727 0.743 40.82 1.3Γ—
WorldCache 0.763 0.727 0.745 26.28 2.1Γ—
Cosmos 14B Baseline 0.792 0.746 0.769 216.25 1.00Γ—
DiCache 0.792 0.745 0.768 148.36 1.4Γ—
WorldCache 0.795 0.746 0.771 98.61 2.14Γ—

Table 2: PAI-Bench Image-to-World (I2W) Results

Model Method Domain Avg Quality Avg Overall Latency (s) Speedup
Cosmos 2B Baseline 0.845 0.761 0.803 55.04 1.00Γ—
DiCache 0.835 0.752 0.794 39.68 1.40Γ—
WorldCache 0.840 0.756 0.798 24.48 2.30Γ—
Cosmos 14B Baseline 0.860 0.769 0.814 210.07 1.00Γ—
DiCache 0.855 0.767 0.811 146.04 1.40Γ—
WorldCache 0.859 0.768 0.813 99.25 2.18Γ—

2. Architecture Transfer: WAN2.1

WorldCache is backbone-agnostic. On the latest WAN2.1 architecture, it achieves superior speed-quality tradeoffs compared to DiCache.

Table 3: WAN2.1 Transfer Results

Backbone Method Overall Latency (s) Speedup
T2W 1.3B Baseline 0.7727 120.04 1.00Γ—
DiCache 0.7703 61.57 1.96Γ—
WorldCache 0.7721 50.84 2.36Γ—
I2W 14B Baseline 0.7384 475.60 1.00Γ—
DiCache 0.7311 291.91 1.53Γ—
WorldCache 0.7388 206.73 2.31Γ—

3. EgoDex-Eval: Robotics Performance

In egocentric robotics tasks requiring high spatial precision, WorldCache maintains frame-level fidelity (PSNR/SSIM) while enabling real-time-friendly inference.

Table 4: EgoDex-Eval (Robotics Evaluation)

Model Method PSNR SSIM LPIPS Latency (s) Speedup
WAN2.1-14B Baseline 13.30 0.503 0.459 391.90 1.00Γ—
DiCache 12.95 0.491 0.461 208.60 1.88Γ—
WorldCache 13.19 0.498 0.460 171.60 2.30Γ—
Cosmos-2.5-2B Baseline 12.87 0.455 0.518 70.01 1.00Γ—
DiCache 12.63 0.445 0.531 51.97 1.34Γ—
WorldCache 12.82 0.466 0.518 43.24 1.62Γ—
DreamDojo-2B Baseline 23.63 0.775 0.226 19.73 1.00Γ—
DiCache 20.41 0.734 0.252 12.46 1.58Γ—
WorldCache 23.69 0.737 0.251 10.36 1.90Γ—

πŸ“ˆ Denoising Step Budget Scaling

WorldCache scales effectively with the denoising step budget. For longer generation trajectories (more steps), the efficiency gains increase as the underlying motion manifold stabilizes.

Denoising Step Budget Scaling
Efficiency scaling: WorldCache achieves up to 3.1Γ— speedup as denoising steps increase, while maintaining superior quality over DiCache.


πŸ–ΌοΈ Qualitative Gallery

WorldCache maintains flawless temporal coherence across diverse domains, from urban traffic to precision robotics.

Main Qualitative Comparison Main comparison: WorldCache stays closer to baseline rollout in dynamic and interaction-heavy regions.

Cosmos-2B Crossing
Cosmos-2B crossing scene: Preserves pedestrian identity and background consistency.
Cosmos-14B Kitchen
Cosmos-14B kitchen interaction: Stable hand and carried object tracking.
Dynamic Scene
Dynamic scene: Balanced performance in high-velocity regions.
Temporal Consistency
Consistency: Zero "ghosting" artifacts at 2.5Γ—+ speedup.

πŸ™ Acknowledgements

We acknowledge the following works that inspired this project:

  • Cosmos-Predict 2.5 β€” NVIDIA's world foundation model platform.
  • WAN2.1 β€” Open suite of video foundation models.
  • DreamDojo β€” Generalist robot world model from the NVIDIA GEAR Team.
  • DiCache β€” DiCache: Let Diffusion Model Determine Its Own Cache, ICLR 2026.

πŸ“ Citation

@inproceedings{nawaz2026worldcache,
  title     = {WorldCache: Content-Aware Caching for Accelerated Video World Models},
  author    = {Umair Nawaz and Ahmed Heakl and Ufaq Khan and Abdelrahman Shaker and Salman Khan and Fahad Shahbaz Khan},
  journal   = {arXiv preprint arXiv:2603.22286},
  year      = {2026}
}

πŸ“„ License

This project inherits the Apache 2.0 License from the NVIDIA Cosmos-Predict2 codebase.

About

WorldCache: Content-Aware Caching for Accelerated Video World Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages