Skip to content

xavihart/PIO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Point-It-Out (PIO)

Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

PIO Benchmark Overview PIO Benchmark Overview

arXiv:2509.25794   PIO Data Explorer   PIO Project Website

TL;DR: We move embodied reasoning evaluation from multiple-choice answers (A/B/C/D) to visual grounding, testing whether VLMs truly know where to point in realistic embodied scenarios.


Haotian Xue1,2*, Yunhao Ge2, Yu Zeng2, Zhaoshuo Li2, Ming-Yu Liu2, Yongxin Chen1,2, Jiaojiao Fan2
1Georgia Tech   ·   2NVIDIA

📄 Read the paper on arXiv



Introduction

Most embodied reasoning benchmarks evaluate models via multiple-choice QA (MCQ): the model picks from options like A/B/C/D.
However, real-world agents need to do more than choose an answer — they must ground their reasoning in the scene:

“Where exactly should I act?”
“Which object or location is the right one?”

Point-It-Out (PIO) reframes embodied reasoning as visual grounding in realistic scenarios.
Instead of asking “Which option is correct?”, PIO asks:

  • “Point to the correct object under certain constraints.”
  • “Point to the location that best satisfies a task goal.”
  • “Draw the trajectory that completes the task safely.”

We argue this is a crucial step toward connecting multi-modal language models to the physical world.


Three-Stage Design

PIO consists of three stages, each targeting a different aspect of embodied reasoning:

  • (S1) Object Reference
    Refer to a specific object under different constraints:

    • Location (left/right/behind, etc.)
    • Object part (handle, lid, door, etc.)
    • Appearance attributes
  • (S2) Task-Centric Grounding
    Refer to where to act based on a downstream goal:

    • Recommendation (e.g., I am thirsty, point to something that can help)
    • Affordance (where to grasp / push / open)
    • Next-state prediction (where an object will move or should move)
  • (S3) Visual Trace Prediction
    Predict a continuous trajectory in the image to accomplish a task:

    • Draw a motion trace to open/close objects

PIO Benchmarks

PIO currently contains ~600 human-curated questions:

  • S1: ~230 samples
  • S2: ~270 samples
  • S3: ~100 samples

For S1 / S2:

  • We provide human-annotated ground truth as segmentation masks.

For S3:

  • We provide a VLM-based judging prompt to evaluate the quality of predicted visual traces (and recommend human evaluation for highest reliability).

Loading S1 / S2 Data

import json

s1s2_data = json.load(open('data/s1s2.json', 'rb'))
for sample in s1s2_data:
    polygon = sample['polygon']        # ground-truth seg mask (COCO-style polygon)
    image_path = sample['image_path']  # relative image path
    height, width = sample['height'], sample['width']
    prompt = sample['lang']            # task prompt
    s1_or_s2 = sample['s1_or_s2']      # 's1' or 's2'
    subclass = sample['subclasses']    # subclass label, e.g. "recommendation"
  • The root of image_path is:
data/images_s1s2/

Loading S3 Data

import json

s3_data = json.load(open('data/s3.json', 'rb'))
for sample in s3_data:
    image_path = sample['image_path']
    prompt = sample['lang']
  • The root of image_path is:
data/images_s3/

Demo Inference on Gemini-2.5-pro

We provide ready-to-run demo scripts to perform inference on all three stages (S1/S2/S3) using Gemini-2.5-flash (or any other VLM defined in code/vlms/).

S1 / S2 Demo

Simply run:

EXP_NAME="demo"
MAX_CASES=5  # set to -1 to run on all cases

python code/test_s1s2.py \
  --test_models gemini-2.5-flash \
  --max_cases ${MAX_CASES} \
  --exp_name ${EXP_NAME} \
  --which_s both \
  --save_path results \
  --s1s2_path data/s1s2.json

S3 Demo

EXP_NAME="demo"
MAX_CASES=5  # set to -1 to run on all cases

python code/test_s3.py \
  --test_model gemini-2.5-flash \
  --max_cases ${MAX_CASES} \
  --exp_name ${EXP_NAME} \
  --json_path data/s3.json \
  --image_root data/images_s3 \
  --save_path results

After running these scripts, you will find:

  • Visualization: vis.png (or similar) containing:

    • GT segmentation mask / polygon (S1/S2)
    • Model prediction (bbox / points / trajectory)
  • Raw predictions: info.npy files containing model outputs and metadata.

Example visualizations (top row: S1, middle row: S2, bottom row: S3):

PIO S1 Example 1 PIO S1 Example 2 PIO S1 Example 3

PIO S2 Example 1 PIO S2 Example 2 PIO S2 Example 3

PIO S3 Example 1 PIO S3 Example 2 PIO S3 Example 3


Test New Models

The VLM zoo is defined in:

code/vlms/__init__.py

To plug in a new VLM:

  1. Register the model in get_vlm():

    • Add your <model_name> and corresponding <ModelClass>.
  2. Implement your model class under code/vlms/:

    • Create new_model.py with a class implementing:

      • __call__(...) – used by default for S1/S2.
      • preprocess_image(...) – optional helper for image processing.
      • s3(...) – used for S3 trajectory prediction.
  3. Add prompts:

    • Define your model’s prompt template and set the path via:

      self.question_template = "prompts/your_model_prompt.txt"

We include demo implementations for:

  • GPT series
  • Gemini series
  • MoLMO
  • RoboRefer
  • RoboBrain

along with their prompts in the prompts/ directory.


Evaluation

S3 Evaluation

For S3 (visual trace prediction), we provide an automatic VLM-based evaluation prompt in:

prompts/s3_auto.txt

However, we strongly recommend human evaluation for rigorous assessment, especially for safety- and planning-critical tasks.

S1 / S2 Evaluation

For S1 / S2, we provide an evaluation script:

code/eval_s1s2.py

This script:

  • Loads predictions from results/ (output by test_s1s2.py)
  • Computes IoU-based metrics against ground-truth segmentation
  • Prints summary scores (S1, S2, and subclasses) to the console

You can use it to compare different models on the exact same benchmark.


Cite our paper if needed:

If you find PIO useful, please consider starring the repo and citing the paper 💫

@article{xue2025point,
  title={Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding},
  author={Xue, Haotian and Ge, Yunhao and Zeng, Yu and Li, Zhaoshuo and Liu, Ming-Yu and Chen, Yongxin and Fan, Jiaojiao},
  journal={arXiv preprint arXiv:2509.25794},
  year={2025}
}

About

Code and instructions to use Point-It-Out benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published