Skip to content

vista-wm/Vista-WM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scaling World Model for Hierarchical Manipulation Policies

Paper HuggingFace Model Project Page


🧠 Overview

This repository provides the official inference code for the embodied world model in VISTA


📦 Installation

1. Clone the Repository

git clone https://github.com/vista-wm/Vista-WM
cd Vista-WM

2. Create a Virtual Environment

conda create -n vista python=3.11 -y
conda activate vista

3. Install Dependencies

Install required Python packages via pip:

pip install -r requirements.txt

🤖 Model Weights

Before running the inference, you need to download the following model weights and place them in the specified paths:

Step 1: Download VISTA World Model Checkpoint

Download the model checkpoint from Hugging Face and place it in inference/ckpt:

# Install huggingface-hub if not installed
pip install huggingface-hub

cd inference
# Download model weights
huggingface-cli download vista-wm/vista-wm-ckpt --repo-type model --local-dir ./ckpt

Step 2: Download IBQTokenizer Weights

Download the IBQTokenizer weights and place them in inference/IBQTokenizer:

huggingface-cli download vista-wm/IBQTokenizer --repo-type model --local-dir ./IBQTokenizer

🚀 Launch Gradio Demo

python app.py

📝 Prompt Format

The model expects a structured prompt format to enable subtask decomposition and goal image generation.

Standard Template

Robot Arm Type: {robot arm type}. Instruction: {task instruction}. Finish the task with {n} steps.

Supported robot arm types:

  • Songling Aloha
  • Songling Aloha Multi View
  • Widow X
  • Google Everyday
  • AgiBot Dual-Arm
  • xArm

Example prompt:

Robot Arm Type: Songling Aloha Multi View. Instruction: put the apple on the plate. Finish the task with 2 steps.

Manual Intervention in Subtask Planning

If the automatically generated subtask plan is suboptimal, the system supports manual subtask specification.

You may directly input subtasks in the "Manual subtask input" field.

Manual subtask input format:

Step 1: pick the apple with the right arm. Step 2: place the apple on the plate using the right hand.

This allows users to refine the model’s hierarchical plan in order to obtain the desired goal image generation.

Multi-View Image Upload Convention

When using multi-view settings (e.g., Songling Aloha Multi View), the initial observation images must be uploaded in the following fixed order:

  1. Head camera
  2. Left wrist camera
  3. Right wrist camera

Maintaining this order is required for correct multi-view conditioning of the world model.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages