WorldGym: World Model as An Environment for Policy Evaluation [paper] [website] [demo]

Julian Quevedo¹, Ansh Kumar Sharma², Yixiang Sun², Varad Suryavanshi², Percy Liang¹, Sherry Yang^1,2,3

Stanford University¹ New York University² Google DeepMind³

Overview

This repository contains the evaluation harness used in Evaluating Robot Policies in a World Model. It bundles

the pretrained diffusion world model,
policy-specific runners for OpenVLA, Octo, SpatialVLA, and RT-1-X, and
utilities for dataset conversion and automatic VLM scoring.

Installation

Install the package in editable mode (optionally with extras for specific policies):

pip install -e .[openvla,spatialvla,octo,rt1]

Extras are additive—omit the ones you do not need. Some stacks have additional one-off steps:

Octo –
1. install the dlimp library: pip install git+https://github.com/kvablack/dlimp@5edaa4691567873d495633f2708982b42edf1972 --no-deps
2. edit the installed Octo package (typically under your Python site-packages) and update octo/utils/typing.py so that it defines PRNGKey = jax.random.PRNGKey.
RT-1-X – obtain the official JAX checkpoint from the Open X-Embodiment release.

World-model checkpoint

The evaluation runners require a diffusion world-model checkpoint, e.g. mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt (≈9 GB). This checkpoint is available via this Google Drive link:

$ pip install gdown
$ gdown 1uiRP2BuavapMsyP9Cbr25mi_ymk9SEJb

Prepare evaluation data

Point every runner’s --root-dir flag at the directory whose subfolders contain *.png + metadata pairs. The helper discover_trials recursively discovers tasks from that root.

OpenVLA

world-model-eval-openvla \
  --root-dir /path/to/tasks \
  --checkpoint-path ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt \
  --model-name openvla-7b \
  --save-video --video-out-dir ./rollouts/openvla

SpatialVLA

world-model-eval-spatialvla \
  --root-dir /path/to/tasks \
  --checkpoint-path ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt \
  --model-name spatialvla-4b-224-pt

Octo

world-model-eval-octo \
  --root-dir /path/to/tasks \
  --checkpoint-path ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt \
  --model-name octo-base-1.5

RT-1-X

The RT-1 runner uses Abseil flags:

world-model-eval-rt1 \
  --root_dir /path/to/tasks \
  --checkpoint_path /path/to/rt1x_checkpoint \
  --world_model_checkpoint ~/checkpoints/world-model/mixed_openx_9robots_20frames_0p1actiondropout_580ksteps.pt

Pass --save_video / --video_out_dir counterparts where available if you want MP4 rollouts.

Training quick start

This is how you launch training. It will train on the tiny 10-example dataset in sample_data/.

# Replace N with the number of available GPUs
torchrun --nproc_per_node=N train.py

Checkpoints and generated GIF samples will be written to outputs/<timestamp>/.

Train on Open X-Embodiment Datasets

To train on the Open X-Embodiment datasets we used in the paper:

# We'll need tensorflow datasets and tensorflow since this code is 
# based on the original Open X-Embodiment repo.
pip install tensorflow tensorflow_datasets
# For example, download just the RT-1 dataset:
python -m world_model_eval.download_data --dataset_name rt_1
# By default the data will be written to ./converted_datasets.
# To choose your own output directory:
python -m world_model_eval.download_data --dataset_name rt_1 --output_dir <your output dir>

See world_model_eval/download_data.py for more dataset names to choose from.

Then launch training with the correct dataset path:

torchrun --nproc_per_node=N -m world_model_eval.train --dataset_dir ./converted_datasets --subset_names rt_1
# Replace ./converted_datasets if your path is different.

You can enter a comma separated list for subset_names to train on a mixture of multiple datasets. For example, after downloading the rt_1 and bridge_v2 datasets, you can do --subset_names rt_1,bridge_v2 to train on both the RT-1 and Bridge V2 datasets.

Training on Bridge V2

Since Bridge V2 was not included in the original Open X-Embodiment dataset, you'll need to first download the TFDS dataset to your machine like this:

wget -r -np -R "index.html*" https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/

Then, convert the dataset to our format with python -m world_model_eval.download_data --dataset_name bridge_v2, changing BRIDGE_V2_PATH at the top of the script if necessary. Since Bridge V2 is a superset of Bridge V1, choose between either downloading bridge or bridge_v2.

Citation

If you find this work useful, please cite:

@misc{quevedo2025worldgymworldmodelenvironment,
      title={WorldGym: World Model as An Environment for Policy Evaluation}, 
      author={Julian Quevedo and Ansh Kumar Sharma and Yixiang Sun and Varad Suryavanshi and Percy Liang and Sherry Yang},
      year={2025},
      eprint={2506.00613},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.00613}, 
}

Acknowledgements

Boyuan Chen and Kiwhan Song for Diffusion Forcing
DiT
Oasis
open_x_embodiment

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
media		media
sample_data/bridge		sample_data/bridge
src/world_model_eval		src/world_model_eval
.gitignore		.gitignore
README.md		README.md
create_ood_images.py		create_ood_images.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WorldGym: World Model as An Environment for Policy Evaluation [paper] [website] [demo]

Overview

Installation

World-model checkpoint

Prepare evaluation data

OpenVLA

SpatialVLA

Octo

RT-1-X

Training quick start

Train on Open X-Embodiment Datasets

Training on Bridge V2

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

world-model-eval/world-model-eval

Folders and files

Latest commit

History

Repository files navigation

WorldGym: World Model as An Environment for Policy Evaluation [paper] [website] [demo]

Overview

Installation

World-model checkpoint

Prepare evaluation data

OpenVLA

SpatialVLA

Octo

RT-1-X

Training quick start

Train on Open X-Embodiment Datasets

Training on Bridge V2

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages