VFMF: World Modeling by Forecasting Vision Foundation Model Features

Gabrijel Boduljak | Yushi Lan | Christian Rupprecht | Andrea Vedaldi

Abstract

Many recent methods forecast the world by generating stochastic videos. While these excel at visual realism, pixel prediction is computationally expensive and requires translating RGB into actionable signals for decision-making. An alternative uses vision foundation model (VFM) features as world representations, performing deterministic regression to predict future states. These features directly translate into useful signals like semantic segmentation and depth while remaining efficient. However, deterministic regression averages over multiple plausible futures, failing to capture uncertainty and reducing accuracy. To address this limitation, we introduce a generative forecaster using autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. This latent space preserves information more effectively than PCA-based alternatives for both forecasting and other applications like image generation. Our latent predictions decode easily into multiple interpretable modalities: semantic segmentation, depth, surface normals, and RGB. With matched architecture and compute, our method produces sharper, more accurate predictions than regression across all modalities and improves appearance prediction. Our results suggest that stochastic conditional generation of VFM features offers a promising, scalable foundation for future world models.

Method

An overview of our method VFMF. Given RGB context frames $\mathbf{I}_1,\dots,\mathbf{I}_t$, we extract DINO features $\mathbf{f}_1,\dots,\mathbf{f}_t$ and predict the next state feature $\mathbf{f}_{t+1}$. Context features are compressed with a VAE along the channel dimension to produce context latents $\mathbf{z}_1,\dots,\mathbf{z}_t$. Those context latents are concatenated with noisy future latents $\mathbf{z}_{t+1}$ and passed to a conditional denoiser that denoises only the future latents $\mathbf{z}_{t+1}$ while leaving the context latents unchanged. This process repeats autoregressively, with a window of fixed length. Specifically, each time a new latent $\mathbf{z}_{t+1}$ is generated, it is appended to the context while the oldest context latent is popped. The denoised future latents are decoded back to DINO feature space by the VAE decoder. Finally, the reconstructed features can be routed to task-specific modality decoders for downstream tasks or interpretation.

Instructions

Inference

Clone this repository.
Set up environment matching the specification in environment.yml
Download checkpoints
Open a demo notebook. Examples are world-model/cityscapes_demo.ipynb and world-model/kubric_demo.ipynb.
Fix the paths in the first notebook cell

REPO_PATH = "{absolute path to the repository}" 
CKPTS_PATH = "{absolute path to the checkpoints folder}"

We released Kubric and Cityscapes checkpoints and demo inference notebooks.

More code and instructions will be released soon.

Acknowledgements

This repository is based on:

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
vae/configs		vae/configs
world-model		world-model
README.md		README.md
environment.yml		environment.yml
ox.svg		ox.svg
teaser.svg		teaser.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VFMF: World Modeling by Forecasting Vision Foundation Model Features

Method

Instructions

Inference

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

gboduljak/vfmf

Folders and files

Latest commit

History

Repository files navigation

VFMF: World Modeling by Forecasting Vision Foundation Model Features

Method

Instructions

Inference

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages