Scaling World Model for Hierarchical Manipulation Policies

🧠 Overview

This repository provides the official inference code for the embodied world model in VISTA

📦 Installation

1. Clone the Repository

git clone https://github.com/vista-wm/Vista-WM
cd Vista-WM

2. Create a Virtual Environment

conda create -n vista python=3.11 -y
conda activate vista

3. Install Dependencies

Install required Python packages via pip:

pip install -r requirements.txt

🤖 Model Weights

Before running the inference, you need to download the following model weights and place them in the specified paths:

Step 1: Download VISTA World Model Checkpoint

Download the model checkpoint from Hugging Face and place it in inference/ckpt:

# Install huggingface-hub if not installed
pip install huggingface-hub

cd inference
# Download model weights
huggingface-cli download vista-wm/vista-wm-ckpt --repo-type model --local-dir ./ckpt

Step 2: Download IBQTokenizer Weights

Download the IBQTokenizer weights and place them in inference/IBQTokenizer:

huggingface-cli download vista-wm/IBQTokenizer --repo-type model --local-dir ./IBQTokenizer

🚀 Launch Gradio Demo

python app.py

📝 Prompt Format

The model expects a structured prompt format to enable subtask decomposition and goal image generation.

Standard Template

Robot Arm Type: {robot arm type}. Instruction: {task instruction}. Finish the task with {n} steps.

Supported robot arm types:

Songling Aloha
Songling Aloha Multi View
Widow X
Google Everyday
AgiBot Dual-Arm
xArm

Example prompt:

Robot Arm Type: Songling Aloha Multi View. Instruction: put the apple on the plate. Finish the task with 2 steps.

Manual Intervention in Subtask Planning

If the automatically generated subtask plan is suboptimal, the system supports manual subtask specification.

You may directly input subtasks in the "Manual subtask input" field.

Manual subtask input format:

Step 1: pick the apple with the right arm. Step 2: place the apple on the plate using the right hand.

This allows users to refine the model’s hierarchical plan in order to obtain the desired goal image generation.

Multi-View Image Upload Convention

When using multi-view settings (e.g., Songling Aloha Multi View), the initial observation images must be uploaded in the following fixed order:

Head camera
Left wrist camera
Right wrist camera

Maintaining this order is required for correct multi-view conditioning of the world model.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
assets		assets
inference		inference
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scaling World Model for Hierarchical Manipulation Policies

🧠 Overview

📦 Installation

1. Clone the Repository

2. Create a Virtual Environment

3. Install Dependencies

🤖 Model Weights

Step 1: Download VISTA World Model Checkpoint

Step 2: Download IBQTokenizer Weights

🚀 Launch Gradio Demo

📝 Prompt Format

Standard Template

Manual Intervention in Subtask Planning

Multi-View Image Upload Convention

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scaling World Model for Hierarchical Manipulation Policies

🧠 Overview

📦 Installation

1. Clone the Repository

2. Create a Virtual Environment

3. Install Dependencies

🤖 Model Weights

Step 1: Download VISTA World Model Checkpoint

Step 2: Download IBQTokenizer Weights

🚀 Launch Gradio Demo

📝 Prompt Format

Standard Template

Manual Intervention in Subtask Planning

Multi-View Image Upload Convention

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages