Gabrijel Boduljak | Yushi Lan | Christian Rupprecht | Andrea Vedaldi
Many recent methods forecast the world by generating stochastic videos. While these excel at visual realism, pixel prediction is computationally expensive and requires translating RGB into actionable signals for decision-making. An alternative uses vision foundation model (VFM) features as world representations, performing deterministic regression to predict future states. These features directly translate into useful signals like semantic segmentation and depth while remaining efficient. However, deterministic regression averages over multiple plausible futures, failing to capture uncertainty and reducing accuracy. To address this limitation, we introduce a generative forecaster using autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. This latent space preserves information more effectively than PCA-based alternatives for both forecasting and other applications like image generation. Our latent predictions decode easily into multiple interpretable modalities: semantic segmentation, depth, surface normals, and RGB. With matched architecture and compute, our method produces sharper, more accurate predictions than regression across all modalities and improves appearance prediction. Our results suggest that stochastic conditional generation of VFM features offers a promising, scalable foundation for future world models.
An overview of our method VFMF. Given RGB context frames
- Clone this repository.
- Set up environment matching the specification in
environment.yml - Download checkpoints
- Open a demo notebook. Examples are
world-model/cityscapes_demo.ipynbandworld-model/kubric_demo.ipynb. - Fix the paths in the first notebook cell
REPO_PATH = "{absolute path to the repository}"
CKPTS_PATH = "{absolute path to the checkpoints folder}"We released Kubric and Cityscapes checkpoints and demo inference notebooks.
More code and instructions will be released soon.
This repository is based on: