GitHub - LiHaoHN/SimX-OR: SimX-OR: Extending Any Simulation Benchmark to Evaluate the Observational Robustness of VLA Models

SimX-OR: Extending Any Simulation Benchmark to Evaluate the Observational Robustness of VLA Models

A plug-and-play evaluation framework designed to systematically measure the observational robustness of Vision-Language-Action (VLA) models.
It provides a lightweight extension to existing simulation benchmarks without modifying the simulator environments, enabling fast quantitative evaluation of VLA robustness without additional training.

🌟 Key Features

🔹 Plug-and-Play Compatibility (Any Benchmarks)

Works directly on rendered images, no simulator code changes required.
Instantly compatible with any simulator for robustness evaluation.

🔹 Flexible Visual Disturbances (Any Tasks)

Easily custom blur, noise, occlusion, brightness shifts, and more disturbances.
Built on standard libraries like OpenCV and Pillow.

🔹 Reproducible and Fair Evaluation (Any Models)

Each trial maps a determinal random seed, ensuring consistent testing conditions.

🔥 Main Componantes

Supported Simulation Benchmarks:

Various Observational Disturbances

Both temporal and spatial disturbance settings are included. In the temporal dimension, disturbances are applied with varying frequencies (1:0, 1:1, 1:3, and etc.), simulating different levels of temporal consistency. In the spatial dimension, we incorporate multiple categories of visual disturbances:

Blurring
Simulates defocus or motion blur effects by reducing image sharpness.
Jittering
Emulates camera shake or motion-induced streaking across consecutive frames.
Frame Dropping
Mimics temporal desynchronization by repeating frames.
Full Occlusion
Simulates a complete blockage of the camera view, such as by an obstacle.
Overexposing
Represents excessive brightness or lighting interference in specific regions of the frame.
Partial Occlusions
Introduces localized visual blockages, such as a hand or tool partially covering the camera.

See more

Gaussian Noise
Adds continuous noise patterns resembling sensor or environmental interference.
Impulse Noise
Applies salt-and-pepper style pixel corruption to mimic transmission errors or sensor faults.
Resolution Reduction
Simulates low-quality or bandwidth-limited video feeds by degrading spatial resolution.
Camera Rotation
Emulates camera angle perturbations or physical shaking, altering the visual perspective.
Object Distraction
Introduces unrelated moving or static objects that distract from the primary scene or target.
Viewpoint Shift
Simulates changes in camera position or angle that modify the perceived spatial relationships in the scene.
Summary

🏆 SimplerEnv-OR (WidowX)

🏅 Quantitative Results

🔥 Baselines Table:

Code of model and serve has been partially provided in policies and models.

Model Name	support	CodeBase	Installation	Checkpoint
CronusVLA	✅	github	cronusvla	https://huggingface.co/JeasLee/cronusvla_7B_bridge_rt_1
CogACT	✅	github	cogact	https://huggingface.co/CogACT/CogACT-Base
SpatialVLA	✅	github	transformers == 4.47.0	https://huggingface.co/IPEC-COMMUNITY/spatialvla-4b-mix-224-pt
RoboVLMs	✅	github	robovlms	https://huggingface.co/robovlms/RoboVLMs
TraceVLA	✅	github	tracevla	https://huggingface.co/furonghuang-lab/tracevla_7b
π0 (JAX)	✅	github	pi-0(jax)	https://huggingface.co/HaomingSong/openpi0-bridge-lora
π0 (lerobot)	✅	github	lerobot-pi	https://huggingface.co/HaomingSong/lerobot-pi0-bridge
GR00T	✅	github	Isaac-GR00T	https://huggingface.co/ShuaiYang03/GR00T-N1.5-Lerobot-SimplerEnv-BridgeV2

Results in our paper are already provided in ./outputs/SimplerEnv-OR.

💪 Start testing

Below, we provide the step-by-step examples for environment installation, evaluation of CronusVLA, evaluation of other baselines, evaluation of your model:

1️⃣ SimplerEnv Environment Installation

The SimplerEnv experiments are located in ./experiments/SimplerEnv (for models without state input) and ./experiments/SimplerEnv_w_state (for models with state input):

Clone the ManiSkill2_real2sim repository under ./experiments/SimplerEnv. If you want to evluate π0(JAX)/π0-fast/Lerobot-π0/GR00T, please clone the modified version: https://github.com/allenzren/ManiSkill2_real2sim under ./experiments/SimplerEnv_w_state.
Follow the respective README.md files to install both SimplerEnv and ManiSkill2.

conda create -n simpler_env python=3.10
conda activate simpler_env
pip install numpy==1.24.4
cd ./experiments/SimplerEnv
# install Maniskill2
git clone https://github.com/simpler-env/ManiSkill2_real2sim
cd ManiSkill2_real2sim
pip install -e .
# install SimplerEnv
cd ..
pip install -e .

SimplerEnv also requires Vulkan runtime libraries:

# Install Vulkan runtime libraries and tools
conda install conda-forge::libvulkan-loader

2️⃣ Fast Testing

We use CronusVLA as an example to demonstrate how to evaluate models within SimplerEnv-OR.

(1) Environment Setup

Create and activate a dedicated conda environment, then install all dependencies:

conda create --name cronusvla_simpler_env --clone simpler_env
conda activate cronusvla_simpler_env
cd ./models/CronusVLA

# Install PyTorch (with CUDA 12.1 support)
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121

# Install additional dependencies
pip install transformers==4.40.1 accelerate==1.2.1 peft==0.11.1
pip install numpy==1.26.4

# =>> If you run into difficulty, try `pip cache remove flash_attn` first
pip cache remove flash_attn
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"

pip install flash-attn==2.5.5 --no-build-isolation
# or pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

# Install project dependencies
pip install -r requirements.txt

(2) Download Pretrained Checkpoint from Hugging Face

Download the pretrained weights to the local ./outputs directory:

# Download the entire model repository to ./outputs
huggingface-cli download JeasLee/cronusvla_7B_bridge_rt_1 --local-dir ./outputs --local-dir-use-symlinks False

(3) Run Evaluation on an 8-GPU Node

Execute the evaluation script OR_widowx_evaluate.sh:

#!/bin/bash
# get current directory
ROOT_DIR=$(pwd)
# add PYTHONPATH
export PYTHONPATH="$PYTHONPATH:$ROOT_DIR/models/CronusVLA"
echo "PYTHONPATH set to: $PYTHONPATH"

policy_model=cronusvla

NUM=1
bash ./SimplerEnv-OR/meta_test_SimplerEnv/simplerenv_widowx_OR.sh $NUM 0 outputs/cronusvla_7B_bridge_rt_1/checkpoints/step-055000-epoch-04-loss=0.0286.pt ${policy_model} & 
pid1=$!
wait $pid1

⚠️ Make sure to update the correct checkpoint path before running.

Running this on an 8-GPU node ensures faster evaluation. If you wish to test on a single GPU, use OR_widowx_evaluate_one_GPU.sh, and update both the GPU ID and checkpoint path accordingly. Note that single-GPU evaluation will take significantly longer.

(4) Summarize Results

Compute the final evaluation scores using OR_widowx_summary_result.sh:

#!/bin/bash
LogPath=./outputs/SimplerEnv-OR/CronusVLA.log 
python ./experiments/SimplerEnv/simpler_env/calculate_all_OR_table.py --root outputs/cronusvla_7B_bridge_rt_1/results_step-055000-epoch-04-loss=0.0286/step-055000-epoch-04-loss=0.0286.pt/OR --score 60.4 > ${LogPath}

You can specify:

LogPath: the location where the final log file is stored.
--root: the path containing OR evaluation results.
--score: the WidowX score from the original SimplerEnv benchmark.

We also provide the original SimplerEnv WidowX evaluation script widowx_evaluate_origin.sh.

(5) Kill Background Evaluation Processes

Since the server is launched in the background, a simple CTRL+C cannot terminate the process. To kill the evaluations, run:

bash ./SimplerEnv-OR/scripts_CronusVLA/kill_unfinished.sh

⚠️ Note: This script will terminate all running evaluations on the node.

3️⃣ Run Other Baselines

For baseline models, we primarily provide examples on SimplerEnv-OR (WidowX). If you want to reproduce or test specific baselines, refer to the detailed table below for configuration details. We provide partial baseline inference code in ./models and several server scripts in policies or policies (with state).

The same as our Fast Testing procedure, for a new baseline, you need to:

Clone the environment: Duplicate the simpler_env conda environment to a new one (e.g., simpler_env_xxx) to ensure environment isolation.
Pull the codebase: Download the corresponding model codebase or main components (see CodeBase section in CodeBase) into ./models — some are already provided.
Install dependencies: Follow the installation instructions (see Installation section in Table) for each baseline to set up the required dependencies.
Download checkpoints: Download and place the corresponding pretrained checkpoints under ./outputs (see Checkpoint section in Checkpoint).
Run or modify scripts: Refer to the examples in ./SimplerEnv-OR/scripts_CronusVLA and adjust commands as needed. Be sure to update:

# import codebase path
bash export PYTHONPATH="$PYTHONPATH:$ROOT_DIR/models/CronusVLA

⚠️ Notice: RoboVLMs are not supported in policies, you should see and run scripts in mod_RoboVLMs.

4️⃣ 🧩 Customizing for Your Own Models

Compared to the original SimplerEnv, our main modifications are located in
maniskill2_evaluator.py and argparse.py of simplerenv evaluation. Model execution is handled via main_inference_client.py and main_inference_server.py of simpler_env. ./experiments/SimplerEnv_w_state support the additional state input.

To integrate your own model:

Refer to sim_cronusvla in ./experiments/SimplerEnv/simpler_env/policies/ as an example to implement your model interface. If your model need state input, please refer to and use ./experiments/SimplerEnv_w_state/simpler_env/policies/.
Add corresponding options in main_inference_client.py and main_inference_server.py.
Place necessary code of your model under ./models/your_model.
Modify and execute the corresponding scripts in ./SimplerEnv-OR.

5️⃣ 🧩 Customizing for Your Own Disturbance Types

You can add more Disturbance Types by modifying the configuration in the function apply_image_perturbation(), and referring to existing implementations such as motion_blur() for guidance when defining new functions.
After that, make sure to include your newly added disturbance types in OR_cronusvla_bridge.sh.

6️⃣ Visualization

For more experimental results, please refer to Appendix A of the original paper or our homepage.

🏆 SimplerEnv-OR (Google Robot)

SimplerEnv-OR (Google Robot) and SimplerEnv-OR (WidowX) share most of their codebase and dependencies. You can directly follow the instructions provided in WidowX. The scripts you need to execute: OR_google_robot_evaluate_one_GPU.sh and OR_google_robot_summary_result.sh.

See more

It’s worth noting that the Google Robot benchmark already includes variant aggregation tests. Our SimX-OR focuses on extending the visual matching setting, which has minimal overlap with those tests. Given that there are nearly 24 types of visual disturbances and the evaluation process can be time-consuming, we only provide scripts for the Pick Coke and Move Near tasks to enable quick validation.

Our model still supports seamless extensibility — you can easily add new tasks by placing corresponding scripts under ./experiments/SimplerEnv/scripts_self/server_speed_scripts_OR.

🏆 LIBERO-OR

We extend our framework to support LIBERO, with the corresponding implementation located in ./experiments/Libero. Below, we provide a step-by-step example using CronusVLA.

1️⃣ LIBERO Environment Installation

conda create --name libero python=3.10
conda activate libero

Then, clone and install the LIBERO repo:

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git libero
cd libero
pip install -e .

Install additional requirements:

cd deploy/libero
pip install -r libero_requirements.txt

2️⃣ CronusVLA Installation and Downloading

First, clone and activate the conda environment:

conda create --name cronusvla_libero --clone libero
conda activate cronusvla_libero

Then, install all required dependencies following the CronusVLA Environment Setup. All necessary dependencies for CronusVLA are already prepared under ./models/CronusVLA (for other models, please prepare the corresponding codebase). Pretrained CronusVLA weights for LIBERO can be downloaded from the CronusVLA repository.

3️⃣ Run Evaluation

We support all LIBERO suites, including LIBERO-spatial, LIBERO-object, LIBERO-goal, and LIBERO-10. In addition to the standard instruction text and third-person image, we also support settings with or without wrist-view images. Notably, both the third-person and wrist-view images at each timestep receive the same type and the same level of visual perturbation.

Example scripts are provided in ./LIBERO-OR/scripts_CronusVLA. For instance, to evaluate LIBERO-goal, execute the script libero_OR_goal.sh:

#!/bin/bash
# get current directory
ROOT_DIR=$(pwd)
# add PYTHONPATH
export PYTHONPATH="$PYTHONPATH:$ROOT_DIR/models/CronusVLA"
echo "PYTHONPATH set to: $PYTHONPATH"

CUDA_DEVICE=0  # can be modified according to demand
# pre-trained model storage directory
CHECKPOINT=path/to/ckpt.pt
model_family=cronus
task_suite_name=libero_goal
use_wrist_image=False
bash ./LIBERO-OR/meta_test_LIBERO/libero_OR.sh ${CUDA_DEVICE} ${CHECKPOINT} ${model_family} ${task_suite_name} ${use_wrist_image}

You can specify:

CUDA_DEVICE: the GPU to use
CHECKPOINT: the path to your pretrained .pt file
model_family: the model type
task_suite_name: the LIBERO task suite
use_wrist_image: whether to use wrist-view images This script runs on a single GPU and may take some time to complete. For a quick verification, we recommend starting with the LIBERO-spatial suite (more lightwight).

4️⃣ Summarize Results

All evaluation logs are saved in ./experiments/Libero/logs/, and rollout videos are stored in ./experiments/Libero/rollouts/.

🗒️ TODO

The code is still in development. If you encounter any issues or bugs, please feel free to let us know.

We will continue to update experimental results and add new baselines, more simulation benchmark will be also included. Please stay tuned!

📌 Q&A

What’s the Motivation Behind This Project?

Current benchmarks rarely evaluate robustness under visual disturbances, even though it’s crucial for real-world deployment.
We wanted to design a lightweight robustness evaluation tool that introduces minimal additional effort.
SimX-OR serves as a reference framework, not an absolute benchmark — we encourage you to adapt it to your own environments and baselines for fair and consistent comparisons.
The framework will be simple, flexible, and fully customizable.
We will update the code periodically while ensuring backward compatibility across versions.

How SimX-OR Differs from Other Benchmark Variants

Existing benchmark variants often require modifying the simulation environment to introduce disturbances. SimX-OR, in contrast:

Is fully decoupled from the simulator — no changes to the simulation logic are needed.
Can be applied to most existing benchmarks with minimal effort.
Works seamlessly in both simulation and real-world settings.
Introduces observational disturbances not only by type, but also along the temporal dimension, enabling evaluation under varying disturbance frequencies.
Provides a limited yet practical set of disturbance types, operating purely at the image level without altering the underlying simulation design.

Contributing

If you have questions or suggestions for improvement, feel free to open an issue or reach out via email — contributions and feedback are always welcome.

🔗 Citation

@article{li2025cronusvla,
  title={CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation},
  author={Li, Hao and Yang, Shuai and Chen, Yilun and Tian, Yang and Yang, Xiaoda and Chen, Xinyi and Wang, Hanqing and Wang, Tai and Zhao, Feng and Lin, Dahua and others},
  journal={arXiv preprint arXiv:2506.19816},
  year={2025}
}

@article{yang2025instructvla,
  title={Instructvla: Vision-language-action instruction tuning from understanding to manipulation},
  author={Yang, Shuai and Li, Hao and Chen, Yilun and Wang, Bin and Tian, Yang and Wang, Tai and Wang, Hanqing and Zhao, Feng and Liao, Yiyi and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2507.17520},
  year={2025}
}

👏 Acknowledgment

This project is partially supported by OpenVLA, SimplerEnv, LIBERO, CogACT and SimplerEnv-OpenVLA. Thanks for their open-source contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimX-OR: Extending Any Simulation Benchmark to Evaluate the Observational Robustness of VLA Models

📋 Contents

🌟 Key Features

🔹 Plug-and-Play Compatibility (Any Benchmarks)

🔹 Flexible Visual Disturbances (Any Tasks)

🔹 Reproducible and Fair Evaluation (Any Models)

🔥 Main Componantes

Supported Simulation Benchmarks:

Various Observational Disturbances

🏆 SimplerEnv-OR (WidowX)

🏅 Quantitative Results

🔥 Baselines Table:

💪 Start testing

(1) Environment Setup

(2) Download Pretrained Checkpoint from Hugging Face

(3) Run Evaluation on an 8-GPU Node

(4) Summarize Results

(5) Kill Background Evaluation Processes

🏆 SimplerEnv-OR (Google Robot)

🏆 LIBERO-OR

🗒️ TODO

📌 Q&A

🔗 Citation

👏 Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LIBERO-OR		LIBERO-OR
SimplerEnv-OR		SimplerEnv-OR
asset		asset
experiments		experiments
models		models
outputs/SimplerEnv-OR		outputs/SimplerEnv-OR
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SimX-OR: Extending Any Simulation Benchmark to Evaluate the Observational Robustness of VLA Models

📋 Contents

🌟 Key Features

🔹 Plug-and-Play Compatibility (Any Benchmarks)

🔹 Flexible Visual Disturbances (Any Tasks)

🔹 Reproducible and Fair Evaluation (Any Models)

🔥 Main Componantes

Supported Simulation Benchmarks:

Various Observational Disturbances

🏆 SimplerEnv-OR (WidowX)

🏅 Quantitative Results

🔥 Baselines Table:

💪 Start testing

(1) Environment Setup

(2) Download Pretrained Checkpoint from Hugging Face

(3) Run Evaluation on an 8-GPU Node

(4) Summarize Results

(5) Kill Background Evaluation Processes

🏆 SimplerEnv-OR (Google Robot)

🏆 LIBERO-OR

🗒️ TODO

📌 Q&A

🔗 Citation

👏 Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages