From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

1BIGAI 2Tsinghua University 3Peking University 4Beijing Institute of Technology

TL;DR We introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2.6K referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts.




Abstract

3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best-performing models, Google Gemini-2.5-Pro and OpenAI o3, achieve just around 30% accuracy on space-level tasks and slightly above 40% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.


Anywhere3D Dataset

Anywhere3D-Bench involves multi-level 3D visual grounding (part, object, area, space) with distinct expression types.

Annotation Tool Demo

In the Annotation UI, you can explore up to 40 data items by clicking the Load Current ID Annotations button at the bottom of the page. For a detailed guide, use the Tutorial button in the top-right corner.

Examples from the Anywhere3D Benchmark

Here we present a few examples from the Anywhere3D dataset via a data explorer.

To use the data explorer, first select from the available scenes in the selection bar. The visual grounding example will be demonstrated below. Click on the referring expression to visualize its ground-truth bounding box in the scene. Best viewed on monitors.
Control: Click + Drag = Rotate Ctrl + Drag = Translate Scroll Up/Down = Zoom In/Out

-->

Quantitative Results on Anywhere3D-Bench

Results presented in Acc@0.25IoU on Anywhere3D-Bench.

Gemini-2.5-pro's error breakdown on Anywhere3D-Bench

Qualitative Results of Gemini-2.5-pro on Anywhere3D-Bench

Here we present a few qualitative results from Anywhere3D-Bench with Gemini-2.5-pro's reasoning processes.
Green bounding boxes represent ground-truth while red boxes represent Gemini-2.5-pro's prediction.
The error in reasoning process made by Gemini-2.5-pro is highlighted in bold.

Non-thinking model v.s. thinking model on Anywhere3D-Bench

Here we present a comparison between the best-performing non-thinking model(GPT-4.1) and the best performing thinking model(Gemini-2.5-pro) on Anywhere3D-Bench.
Green bounding boxes represent ground-truth while red boxes represent each model's prediction.
The error in reasoning process made by Gemini-2.5-pro is highlighted in bold.

Acknowledgements

We would like to especially thank ScanRefer for providing an excellent 3D annotation interface, which greatly facilitated the annotation process. We also appreciate the modifications made by SQA3D to the ScanRefer annotation interface. The annotation interface used in Anywhere3D was adapted from their well-designed interfaces. We are deeply grateful for their wonderful design and generous sharing with the community.

Also, we would like to thank the open source of the following projects:

3D Visual Grounding Models: 3D-VisTA, PQ3D, Chat-Scene

We also wish to thank the numerous inspiring works on 3D visual grounding and spatial intelligence that have informed and motivated our research, though it is difficult to list all of them here.

BibTeX

@misc{anywhere3d,
      title={From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes}, 
      author={Tianxu Wang and Zhuofan Zhang and Ziyu Zhu and Yue Fan and Jing Xiong and Pengxiang Li and Xiaojian Ma and Qing Li},
      year={2025},
      eprint={2506.04897},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04897}, 
}