WorldEval: World Model as Real-World Robot Policies Evaluator

World Model as Real-World Robot Policies Evaluator

📰 News

May. 19th, 2025: Our code is released!
May. 19th, 2025: Worldeval is out! Paper can be found here. The project web can be found here.

Data Preparation

Our robot trajectory data format is the same as act, so you need to transfer your data into h5py format.

# h5 data structure
root
  |-action (100,14)
  |-language_raw (1,)
  |-substep_reasonings(100,)
  |-observations
      |-images # multi-view
          |-cam_left_wrist (100,480,640,3)
          |-cam_right_wrist (100,480,640,3)
          |-cam_high (100,480,640,3)
      |-joint_positions (100,14)
      |-qpos (100,14)
      |-qvel (100,14)

Download Pretrained VLA Policy

The weights of vla policy used in our paper are listed as following:

Model	Link
Pi0	huggingface
DexVLA	huggingface
Diffusion Policy	huggingface

Download WAN2.1 Weights

Wan-Video is a collection of video synthesis models open-sourced by Alibaba.

Developer	Name	Link	Scripts
Wan Team	14B image-to-video 480P	Link	wan_14b_image_to_video.py
Wan Team	14B image-to-video 720P	Link	wan_14b_image_to_video.py

Install

please install Worldeval from source code.

git clone https://github.com/liyaxuanliyaxuan/Worldeval.git
cd Worldeval
pip install -e .

Wan-Video supports multiple Attention implementations. If you have installed any of the following Attention implementations, they will be enabled based on priority.

Flash Attention 3
Flash Attention 2
Sage Attention
torch SDPA (default. torch>=2.5.0 is recommended.)

Train

Step 1: Organize your dataset

You need to organize the HDF5 files containing the robot trajectory data as follows:

data/example_dataset/
├── metadata.csv
└── train (empty dir to store processed tensor)

metadata.csv:

file_path,file_name,text
/path/to/episode_0.hdf5,episode_0.hdf5,""

Step 2: Data process

cd wanvideo

run scripts/data_process.sh

#! /bin/bash
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_wan_t2v_act_embed.py \
  --task data_process \
  --dataset_path data/dex2_example_dataset \
  --output_path ./models \
  --text_encoder_path "Wan2.1-I2V-14B-480P/models_t5_umt5-xxl-enc-bf16.pth" \
  --image_encoder_path "Wan2.1-I2V-14B-480P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" \
  --vae_path "Wan2.1-I2V-14B-480P/Wan2.1_VAE.pth" \
  --tiled \
  --num_frames 81 \
  --height 480 \
  --width 832 \
  --encode_mode dexvla \
  --action_encoded_path data/dex2_example_dataset/train/all_actions_dex.pt

After that, some cached files will be stored in the dataset folder.

data/example_dataset/
├── metadata.csv
└── train
    ├── file1.hdf5.tensors.pth
    └── file2.hdf5.tensors.pth

Step 3: Run lora training script

Run scripts/train.bash:

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
python train_wan_t2v.py \
 --task train \
 --train_architecture lora \
 --dataset_path data/example_dataset \
 --output_path ./models \
 --dit_path "Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00001-of-00007.safetensors,Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00002-of-00007.safetensors,Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00003-of-00007.safetensors,Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00004-of-00007.safetensors,Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00005-of-00007.safetensors,Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00006-of-00007.safetensors,Wan2.1-I2V-14B-480P/diffusion_pytorch_model-00007-of-00007.safetensors" \
 --image_encoder_path "Wan2.1-I2V-14B-480P/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" \
 --steps_per_epoch 500 \
 --max_epochs 40 \
 --learning_rate 1e-4 \
 --lora_rank 16 \
 --lora_alpha 16 \
 --lora_target_modules "q,k,v,o,ffn.0,ffn.2,action_alpha,action_proj.0,action_proj.2"\
 --accumulate_grad_batches 1 \
 --use_gradient_checkpointing

please separate the safetensor files with a comma. For example: models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00001-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00002-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00003-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00004-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00005-of-00006.safetensors,models/Wan-AI/Wan2.1-T2V-14B/diffusion_pytorch_model-00006-of-00006.safetensors.

Inference

Step1: Extract action embeddings

Extract action embeddings using different VLA policies, prepare the encoded actions, and save them in a .pt file with the following structure:

{
  "file_path": ["path/to/file1.hdf5", "path/to/file2.hdf5"],
  "encoded_action": [latent_action_vector1, latent_action_vector2]
}

Repositories for some VLA policies can be found below:

DexVLA
Pi0

Step2: Sample frames

Use utils/sample_frames_from_dir_for_test to extract sample frames from the HDF5 file for testing; this will generate a metadata.json file and save the first frame for use in generation.

Step3: Run inference script

cd wanvideo

Run scripts/inference.bash

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python infer.py \
--lora_path "" \
--meta_path "" \  # path of metadata.json generated by step 2
--output_subdir "lora_act_alpha_0.3_dex_ep30" \
--action \
--action_alpha 0.3 \
--action_dim   1280 \
--action_encoded_path "" # path of pt file generated by step 1

Acknowledgement

We build our project based on:

WAN2.1: a comprehensive and open suite of video foundation models that pushes the boundaries of video generation.
DiffSynth Studio: an open-source project aimed at exploring innovations in AIGC technology, licensed under the Apache License 2.0. Significant modifications have been made by worldeval, including:
- Process robot data
- Train with latent action
- Inference with latent action

This project contains code licensed under:

Apache License 2.0 (from the original project)
MIT License (for modifications made by Yaxuan Li)

Citation

If you find Worldeval useful for your research and applications, please cite using this BibTeX:

@misc{li2025worldevalworldmodelrealworld,
      title={WorldEval: World Model as Real-World Robot Policies Evaluator}, 
      author={Yaxuan Li and Yichen Zhu and Junjie Wen and Chaomin Shen and Yi Xu},
      year={2025},
      eprint={2505.19017},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.19017}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
apache-original		apache-original
assets		assets
diffsynth		diffsynth
wanvideo		wanvideo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WorldEval: World Model as Real-World Robot Policies Evaluator

📰 News

Data Preparation

Download Pretrained VLA Policy

Download WAN2.1 Weights

Install

Train

Step 1: Organize your dataset

Step 2: Data process

Step 3: Run lora training script

Inference

Step1: Extract action embeddings

Step2: Sample frames

Step3: Run inference script

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WorldEval: World Model as Real-World Robot Policies Evaluator

📰 News

Data Preparation

Download Pretrained VLA Policy

Download WAN2.1 Weights

Install

Train

Step 1: Organize your dataset

Step 2: Data process

Step 3: Run lora training script

Inference

Step1: Extract action embeddings

Step2: Sample frames

Step3: Run inference script

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages