Skip to content

Official Implementation of "From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes"

Notifications You must be signed in to change notification settings

anywhere-3d/Anywhere3D

Repository files navigation

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang1, Zhuofan Zhang1,2, Ziyu Zhu1,2, Yue Fan1, Jing Xiong1,3, Pengxiang Li1,4, Xiaojian Ma1, Qing Li1, *

*: corresponding author

1 BIGAI, 2Tsinghua University, 3Peking University, 4Beijing Institute of Technology

arXiv Code Data Annotator

Figure 1: Multi-level visual grounding in 3D scenes.

Figure 2: Distinct expression types on Anywhere3D-bench

📰 News

  • 🧑‍💻 2025/05/26 Release Human Annotation Interface Demo, supporting four scenes from ScanNet, MultiScan, 3RScan, ARKitScenes. Click Here to play around and further tutorial.
  • 📄 2025/06/04 Paper submitted to arXiv: Anywhere3D Paper
  • 📺 2025/06/13 Release Video Demo: Anywhere3D Video Demo
  • 🤖 2025/07/07 Add more evaluation results on Anywhere3D-Bench, including the state-of-the-art visual thinking models Google Gemini-2.5-pro and openAI o3: check the updated results on our project page Anywhere3D Project Page
  • 🗂️ 2025/09/02 Release Anywhere3D_v2 dataset Anywhere3d_v2, containing 2886 referring expressions-3D bounding box pairs. We increased the number of annotations for the two most challenging grounding levels in the previous dataset version Anywhere3D: the space level and the part level. Specifically, 106 annotations were added to the space level and 148 to the part level, with a particular focus on expanding the most difficult movement tasks within the part level.
  • 📊 2025/09/07 Release evaluation results on Anywhere3D_v2 dataset on our project page Anywhere3D Project Page
  • 🎉 2025/09/19 Our Paper Anywhere3D is accepted by NeurIPS 2025 Datasets and Benchmark Track!

🧠 Abstract

TL;DR We introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2.6K referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts.

3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.

📦 Anywhere3D-Bench

We release our dataset on huggingface.

🛠️ Implementation

To reproduce the evaluation results on Anywhere3D-Bench

git clone https://github.com/anywhere-3d/Anywhere3D.git
cd Anywhere3D
pip install -r requirements.txt

Generate LLM's prediction on Anywhere3D(Please add your own API-KEY for corresponding models: GPT-4.1, o4-mini, Qwen, DeepSeek...)

cd LLM
python generate_predictions.py

Evaluate LLM's prediction on Anywhere3D(Please extract the center coordinates and sizes of the predicted bounding box first, then evaluate)

cd LLM
python process_bbx_with_regular_expression.py
python process_bbx_with_LLM.py
python evaluate_predictions.py

Generate VLM's prediction on Anywhere3D(Please follow GPT4Scene to generate Bird's Eye View and Video Frames of each scene first and save them in corresponding folder, i.e. ./3RScan_gpt4scene_data, ./arkitscene_gpt4scene_data, ./multiscan_gpt4scene_data, ./scannet_gpt4scene_data. Add your own API-KEY for corresponding models: GPT-4.1, o4-mini, Qwen, InternVL3.)

cd VLM
python generate_predictions.py

Evaluate VLM's prediction on Anywhere3D(Please extract the center coordinates and sizes of the predicted bounding box first, then evaluate)

cd VLM
python process_bbx_with_regular_expression.py
python process_bbx_with_LLM.py
python evaluate_predictions.py

✅ To Do

  • Release Code: Human Annotation Tool
  • Release Code: Caption Generation with Qwen2.5-VL, Object Orientation Generation with Orient Anything

🙏 Acknowledgements

We would like to especially thank ScanRefer for providing an excellent 3D annotation interface, which greatly facilitated the annotation process. We also appreciate the modifications made by SQA3D to the ScanRefer annotation interface. The annotation interface used in Anywhere3D was adapted from their well-designed interfaces. We are deeply grateful for their wonderful design and generous sharing with the community.

We would also like to thank the open source of the following projects:

We also wish to thank the numerous inspiring works on 3D visual grounding and spatial intelligence that have informed and motivated our research, though it is difficult to list all of them here.

📖 Citation

If you find this project helpful, please consider citing:

@misc{anywhere3d,
      title={From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes}, 
      author={Tianxu Wang and Zhuofan Zhang and Ziyu Zhu and Yue Fan and Jing Xiong and Pengxiang Li and Xiaojian Ma and Qing Li},
      year={2025},
      eprint={2506.04897},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04897}, 
}

About

Official Implementation of "From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages