PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

PEEK enhances the zero-shot generalization ability of any RGB-input manipulation policy by showing policies where to focus on and what to do. This guidance is given to the policy via a VLM that predicts paths and masking points to draw onto the policy's input images in closed-loop.

🚀 Quick Start

We include PEEK VLM inference, VLM data labeling, ACT+PEEK pre-trained models + inference (on widowx) + training code, Pi-0+PEEK pre-trained models + inference (on widowx) + training code, and 3D-DA simulation code.

Installation

git clone --recursive https://github.com/peek-robot/peek.git # to download all submodules

# or if you already downloaded the repo without --recursive:
git submodule update --init --recursive

Follow the instructions in peek_vlm to install the VLM.
For ACT baseline training/inference, follow the instructions in lerobot to install the LeRobot.
For Pi-0, follow the instructions in openpi to install OpenPI.
For data labeling, follow the instructions in point_tracking to install the point tracking code.
For 3D-DA simulation, follow the instructions in 3dda to install the 3dda code.

Basic Usage

📖 Overview

PEEK works by:

VLM Fine-tuning: Fine-tune a pre-trained VLM on automatically labeled robotics data to predict paths and masking points
Policy Enhancement: Use the VLM to guide any RGB-input policy during both training and inference
Zero-shot Generalization: Enable policies to generalize to new tasks, objects, and environments

Key Features

🎯 Path Prediction: VLM predicts where the robot should move
🎭 Masking Points: VLM identifies relevant areas to focus on
🔄 Closed-loop Guidance: Real-time VLM predictions during policy execution
🧩 Policy Agnostic: Works with any RGB-input manipulation policy
🌍 Zero-shot Generalization: Tested on 535 real-world evaluations across 17 task variations

🏗️ Repository Structure

peek/
├── lerobot/                    # LeRobot (for ACT)
├── openpi/                    # OpenPI  (for Openpi)
├── 3dda/                      # For 3D-DA simulation code
├── point_tracking/            # Data Labeling code
└── peek_vlm/                  # PEEK-VLM wrapper

📊 Results

PEEK significantly improves policy performance across various scenarios:

Semantic Generalization: Handles novel objects and instructions
Visual Clutter: Robust performance in cluttered environments

Data Annotation / Data Labeling

We use the point tracking code to label the data. To download raw OXE or BRIDGE_v2 datasets like we used for PEEK VLM training, or to try out the data annotation pipeline on your own dataset, see the point tracking for instructions and examples.

VLM Training

We provide VLM model checkpoints and VQA dataset for the experiments in the paper. If you want to train your own VLM, please follow the official VILA SFT instructions.

Using PEEK's VLM to label data

We provide an example of using PEEK's VLM to label data in the peek_vlm folder.

We have inference examples (gradio, server/client), and we also have an example of batched labeling of VLM outputs at bridge example with instructions in the README.

Alternatively, if you just need the PEEK VLM path/mask labels for BRIDGE_v2, download here.

Making a dataset with PEEK VLM labels

We provide an example dataset conversion scripts for PEEK-VLM generated labels for BRIDGE_v2 to demonstrate how to convert VLM-labeled data to LeRobot data for upload and training with ACT/Pi-0/any lerobot dataset supported policy and codebase. See: openpi/examples/bridge/convert_bridge_data_to_lerobot.py.

Policy Training and Evaluation/Inference

We include policy checkpoints for ACT+PEEK and Pi-0+PEEK trained on BRIDGEv2 (along with standard ACT and Pi-0 checkpoints). We also include training/inference code for experiments in the paper:

LeRobot repo for ACT training/inference.
OpenPI repo for Pi-0 training/inference.
3D-DA repo for 3D-DA sim training (for sim2real).

📄 Citation

@inproceedings{zhang2025peek,
    title={PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies}, 
    author={Jesse Zhang and Marius Memmel and Kevin Kim and Dieter Fox and Jesse Thomason and Fabio Ramos and Erdem Bıyık and Abhishek Gupta and Anqi Li},
    booktitle={arXiv:2509.18282},
    year={2025},
}

🙏 Acknowledgments

Built on top of VILA-1.5

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
lerobot @ d1c26bd		lerobot @ d1c26bd
openpi @ 34960b4		openpi @ 34960b4
peek_vlm @ 7aa1f89		peek_vlm @ 7aa1f89
point_tracking @ 48b9d38		point_tracking @ 48b9d38
threedda @ 8542f7d		threedda @ 8542f7d
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
peek_teaser.jpg		peek_teaser.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

🚀 Quick Start

Installation

Basic Usage

📖 Overview

Key Features

🏗️ Repository Structure

📊 Results

Data Annotation / Data Labeling

VLM Training

Using PEEK's VLM to label data

Making a dataset with PEEK VLM labels

Policy Training and Evaluation/Inference

📄 Citation

🙏 Acknowledgments

🔗 Links

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

License

peek-robot/peek

Folders and files

Latest commit

History

Repository files navigation

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

🚀 Quick Start

Installation

Basic Usage

📖 Overview

Key Features

🏗️ Repository Structure

📊 Results

Data Annotation / Data Labeling

VLM Training

Using PEEK's VLM to label data

Making a dataset with PEEK VLM labels

Policy Training and Evaluation/Inference

📄 Citation

🙏 Acknowledgments

🔗 Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages