From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang¹, Zhuofan Zhang^1,2, Ziyu Zhu^1,2, Yue Fan¹, Jing Xiong^1,3, Pengxiang Li^1,4, Xiaojian Ma¹, Qing Li^{1, *}

*: corresponding author

¹ BIGAI, ²Tsinghua University, ³Peking University, ⁴Beijing Institute of Technology

Figure 1: Multi-level visual grounding in 3D scenes.

Figure 2: Distinct expression types on Anywhere3D-bench

📰 News

🧑‍💻 2025/05/26 Release Human Annotation Interface Demo, supporting four scenes from ScanNet, MultiScan, 3RScan, ARKitScenes. Click Here to play around and further tutorial.
📄 2025/06/04 Paper submitted to arXiv: Anywhere3D Paper
📺 2025/06/13 Release Video Demo: Anywhere3D Video Demo
🤖 2025/07/07 Add more evaluation results on Anywhere3D-Bench, including the state-of-the-art visual thinking models Google Gemini-2.5-pro and openAI o3: check the updated results on our project page Anywhere3D Project Page
🗂️ 2025/09/02 Release Anywhere3D_v2 dataset Anywhere3d_v2, containing 2886 referring expressions-3D bounding box pairs. We increased the number of annotations for the two most challenging grounding levels in the previous dataset version Anywhere3D: the space level and the part level. Specifically, 106 annotations were added to the space level and 148 to the part level, with a particular focus on expanding the most difficult movement tasks within the part level.
📊 2025/09/07 Release evaluation results on Anywhere3D_v2 dataset on our project page Anywhere3D Project Page
🎉 2025/09/19 Our Paper Anywhere3D is accepted by NeurIPS 2025 Datasets and Benchmark Track!

🧠 Abstract

TL;DR We introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2.6K referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts.

3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.57% accuracy on space-level tasks and 33.94% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.

📦 Anywhere3D-Bench

We release our dataset on huggingface.

🔗 Anywhere3D
🔗 Anywhere3D-v2
🔗 Bird Eye View's images and video frames for MLLM evaluation

🛠️ Implementation

To reproduce the evaluation results on Anywhere3D-Bench

git clone https://github.com/anywhere-3d/Anywhere3D.git
cd Anywhere3D
pip install -r requirements.txt

Generate LLM's prediction on Anywhere3D(Please add your own API-KEY for corresponding models: GPT-4.1, o4-mini, Qwen, DeepSeek...)

cd LLM
python generate_predictions.py

Evaluate LLM's prediction on Anywhere3D(Please extract the center coordinates and sizes of the predicted bounding box first, then evaluate)

cd LLM
python process_bbx_with_regular_expression.py
python process_bbx_with_LLM.py
python evaluate_predictions.py

Generate VLM's prediction on Anywhere3D(Please follow GPT4Scene to generate Bird's Eye View and Video Frames of each scene first and save them in corresponding folder, i.e. ./3RScan_gpt4scene_data, ./arkitscene_gpt4scene_data, ./multiscan_gpt4scene_data, ./scannet_gpt4scene_data. Add your own API-KEY for corresponding models: GPT-4.1, o4-mini, Qwen, InternVL3.)

cd VLM
python generate_predictions.py

Evaluate VLM's prediction on Anywhere3D(Please extract the center coordinates and sizes of the predicted bounding box first, then evaluate)

cd VLM
python process_bbx_with_regular_expression.py
python process_bbx_with_LLM.py
python evaluate_predictions.py

✅ To Do

Release Code: Human Annotation Tool
Release Code: Caption Generation with Qwen2.5-VL, Object Orientation Generation with Orient Anything

🙏 Acknowledgements

We would like to especially thank ScanRefer for providing an excellent 3D annotation interface, which greatly facilitated the annotation process. We also appreciate the modifications made by SQA3D to the ScanRefer annotation interface. The annotation interface used in Anywhere3D was adapted from their well-designed interfaces. We are deeply grateful for their wonderful design and generous sharing with the community.

We would also like to thank the open source of the following projects:

3D Datasets: ScanNet, MultiScan, 3RScan, ARKitScenes;
3D Visual Grounding Models: 3D-VisTA, PQ3D, Chat-Scene;
MLLMs: GPT4Scene, LLaVA-OneVision, Qwen2.5-VL;
LLMs: DeepSeek-R1, Qwen3, Qwen2.5.

We also wish to thank the numerous inspiring works on 3D visual grounding and spatial intelligence that have informed and motivated our research, though it is difficult to list all of them here.

📖 Citation

If you find this project helpful, please consider citing:

@misc{anywhere3d,
      title={From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes}, 
      author={Tianxu Wang and Zhuofan Zhang and Ziyu Zhu and Yue Fan and Jing Xiong and Pengxiang Li and Xiaojian Ma and Qing Li},
      year={2025},
      eprint={2506.04897},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04897}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.vscode		.vscode
LLM		LLM
VLM		VLM
anywhere3d_datasets		anywhere3d_datasets
assets		assets
orientation_by_orientanything		orientation_by_orientanything
qwen_captions		qwen_captions
scene_graph_annotation_bias		scene_graph_annotation_bias
scene_graphs_anywhere3D		scene_graphs_anywhere3D
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

📰 News

🧠 Abstract

📦 Anywhere3D-Bench

🛠️ Implementation

✅ To Do

🙏 Acknowledgements

📖 Citation

About

Uh oh!

Releases

Packages

Languages

anywhere-3d/Anywhere3D

Folders and files

Latest commit

History

Repository files navigation

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

📰 News

🧠 Abstract

📦 Anywhere3D-Bench

🛠️ Implementation

✅ To Do

🙏 Acknowledgements

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages