This repository provides the official inference code for the embodied world model in VISTA
git clone https://github.com/vista-wm/Vista-WM
cd Vista-WMconda create -n vista python=3.11 -y
conda activate vistaInstall required Python packages via pip:
pip install -r requirements.txtBefore running the inference, you need to download the following model weights and place them in the specified paths:
Download the model checkpoint from Hugging Face and place it in inference/ckpt:
# Install huggingface-hub if not installed
pip install huggingface-hub
cd inference
# Download model weights
huggingface-cli download vista-wm/vista-wm-ckpt --repo-type model --local-dir ./ckptDownload the IBQTokenizer weights and place them in inference/IBQTokenizer:
huggingface-cli download vista-wm/IBQTokenizer --repo-type model --local-dir ./IBQTokenizerpython app.pyThe model expects a structured prompt format to enable subtask decomposition and goal image generation.
Robot Arm Type: {robot arm type}. Instruction: {task instruction}. Finish the task with {n} steps.
Supported robot arm types:
- Songling Aloha
- Songling Aloha Multi View
- Widow X
- Google Everyday
- AgiBot Dual-Arm
- xArm
Example prompt:
Robot Arm Type: Songling Aloha Multi View. Instruction: put the apple on the plate. Finish the task with 2 steps.
If the automatically generated subtask plan is suboptimal, the system supports manual subtask specification.
You may directly input subtasks in the "Manual subtask input" field.
Manual subtask input format:
Step 1: pick the apple with the right arm. Step 2: place the apple on the plate using the right hand.
This allows users to refine the model’s hierarchical plan in order to obtain the desired goal image generation.
When using multi-view settings (e.g., Songling Aloha Multi View), the initial observation images must be uploaded in the following fixed order:
- Head camera
- Left wrist camera
- Right wrist camera
Maintaining this order is required for correct multi-view conditioning of the world model.
