ESI-Bench:

Towards Embodied Spatial Intelligence that Closes the Perception–Action Loop

Yining Hong*1, Jiageng Liu*2, Han Yin1, Manling Li3, Leonidas Guibas1, Fei-Fei Li1, Jiajun Wu1, Yejin Choi1
1Stanford University   2UCLA   3Northwestern University

🔊 Click the video to turn on sound for the full demo experience

Abstract

Spatial intelligence unfolds through a perception–action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen — occlusion, dynamics, containment, and functionality — beyond the reach of passive sensing. We take a step beyond prior formulations of spatial intelligence, which often emphasize passive perception or assume access to oracle observations, by recasting the observer as an actor. We introduce ESI-Bench, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy — perception, locomotion, and manipulation — and how to act to answer questions that cannot be resolved from passive observation alone. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instruction, while passive multi-view adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness, and their coupling drives cascading failures where bad actions produce bad views which produce worse actions. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect reconstruction proves more harmful than 2D baselines by actively distorting spatial relations. Human studies further reveal that, unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.

ESI-Bench teaser: 10 task categories and 29 subcategories

ESI-Bench is a comprehensive benchmark for embodied spatial intelligence, spanning 10 task categories and 29 subcategories organized around Spelke's four core knowledge systems [Spelke & Kinzler, 2007]: object representation, layout and geometry, number representation, and agents and goal-directed actions.

10Task Categories
29Subcategories
3,081Task Instances

Task Taxonomy

Each category targets a distinct spatial faculty structurally inaccessible to passive sensing. Across all categories, the correct answer emerges not from any single image but from the agent's capacity to act selectively and reason over the result.

Manipulation reveals containment capacity hidden from view.

Rigid Containment

Plan placement of multiple objects across multiple containers.

Liquid Volume

Compare liquid-holding capacity across containers.

Deformable Fitting

Decide whether a deformable container conforms to an object.

Predict motion and stability under shape, mass, geometry.

Inclined Plane

Predict object motion and stability on slopes.

Stacking & Stability

Whether objects stack or balance given shape, mass, and geometry.

Active repositioning to disambiguate mirror vs. real-world content.

Reflection Authoring

Distinguish real objects from mirror reflections.

Spatial Relations

Infer relations across mirror and real-world views.

Correspondence

Identify which objects appear in the mirror given the real scene.

Repositioning to resolve viewpoint-dependent phenomena.

Partial Occlusion

Reason about objects hidden behind other scene elements.

View Hallucination

Detect objects whose visibility changes critically with viewing angle.

Material Transparency

Reason about objects seen through transparent surfaces.

Locomotion to overcome forced-perspective distortions.

Dimensional Size

Compare relative sizes of objects across vantage points.

Spatial Distance

Compare relative distances with respect to a reference object.

Counting under occlusion, segmentation, and ambiguity.

Counting w/ Occlusion

Count objects partially obscured by other scene elements.

Spatial Segmentation

Count objects separated across distinct spatial regions.

Category Ambiguity

Count visually similar objects requiring fine-grained distinction.

Merged Observation

Count groups that appear visually merged from a single view.

Illumination Variability

Count objects under challenging or non-uniform lighting.

Structural Enclosure

Count objects hidden within enclosed or covered spaces.

Navigation to vantage points that break projective symmetry.

Linear Alignment

Whether objects are arranged along a common axis.

Geometric Configuration

Identify the shape formed by a set of objects (e.g., equilateral triangle).

Physical Contact

Detect whether two or more objects are in direct contact.

Multi-step locomotion to construct topological representations.

Topology & Connectivity

Whether two locations or regions are mutually reachable.

Traversable Passage

Identify navigable corridors or passageways between regions.

Regional Boundary

Identify and delineate distinct functional spatial regions.

Long-Term Navigation

Plan multi-step navigation toward a distant goal.

Manipulation and interaction to trigger or observe state changes.

Unobserved State Change

Infer scene changes that occurred during an unobserved interval.

Multi-Agent Interaction

Reason about scene dynamics induced by other agents.

Reasoning over ordered actions to determine causal dependencies.

Action Order Inference

Determine the correct procedural ordering of an action sequence.

Three Departures from Prior Spatial Benchmarks

01

From sensing to competence

Agents are evaluated not only on what they can perceive, but on whether they know how to act to perceive it — closing the loop between observation and action.

02

Selective sensing

Agents must determine which observations are worth acquiring, prioritizing task-relevant information over redundant or uninformative inputs.

03

Resolving perceptual mirages

Agents must reason through incomplete or misleading observations to infer hidden spatial structures and physical constraints beyond what is directly observed.

Task Distribution Radar Chart

Key Findings

Finding 1

Action blindness dominates perceptual blindness — and their coupling drives failure cascades.

  • Without explicit instruction, active agents spontaneously discover emergent spatial strategies (e.g., move-behind, top-down repositioning, pick-up, pour-out) — driving large gains over passive baselines.
  • For most tasks, perception is not the bottleneck: with the right viewpoint, models succeed dramatically (e.g., Gemini 3.1 jumps from 14.6% → 95.1% on Partial Occlusion under oracle views).
  • Passive multi-view adds noise, not signal: GPT-5 even drops from 53.9% to 49.1% on Spatial Distance despite consuming far more images.
  • Suboptimal actions produce uninformative views, which trigger worse subsequent actions — a compounding chain unrecoverable within the step budget (active-to-oracle gap reaches 49.7% on Structural Enclosure).

Emergent Capabilities: Is the chestnut in the glass?

Top Down

Move Behind

Pick Up

Pour Down

Finding 2

3D helps when geometry is perfect — imperfect reconstruction actively misleads.

  • Ground-truth 3D + Gemini reaches 60.4% on Material Transparency vs. 44.0% for 2D Gemini — a +16.4 pt improvement on tasks where 2D projections fundamentally lose depth.
  • VGGT-reconstructed scene graphs degrade performance below 2D baselines: 9.9% vs. 27.5% on Geometric Configuration, as geometric artifacts distort fine-grained spatial relations.
  • Imperfect 3D grounding is not a neutral failure — it amplifies errors by feeding the reasoner a corrupted scene graph.
Finding 3

Models can see — but do not know when they have seen enough.

  • Humans seek viewpoints that falsify their hypothesis; models seek confirmation and tend to repeat motions in the same direction.
  • Models commit prematurely with uniformly high confidence, anchoring to first impressions and ignoring contradictory observations.
  • This is a metacognitive failure, not a perceptual one: neither better perception nor more embodied interaction alone closes the gap.

Citation

@article{hong2026esibench,
  title     = {{ESI-Bench}: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop},
  author    = {Hong, Yining and Liu, Jiageng and Yin, Han and Li, Manling and Guibas, Leonidas and Li, Fei-Fei and Wu, Jiajun and Choi, Yejin},
  journal   = {arXiv preprint},
  year      = {2026},
  url       = {https://esi-bench.github.io/}
}