Anywhere3D

Abstract

3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,632 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best-performing models, Google Gemini-2.5-Pro and OpenAI o3, achieve just around 30% accuracy on space-level tasks and slightly above 40% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scene beyond object-level semantics.

Examples from the Anywhere3D Benchmark

Here we present a few examples from the Anywhere3D dataset via a data explorer.

To use the data explorer, first select from the available scenes in the selection bar. The visual grounding example will be demonstrated below. Click on the referring expression to visualize its ground-truth bounding box in the scene. Best viewed on monitors.
Control: Click + Drag = Rotate Ctrl + Drag = Translate Scroll Up/Down = Zoom In/Out

Qualitative Results of Gemini-2.5-pro on Anywhere3D-Bench

Here we present a few qualitative results from Anywhere3D-Bench with Gemini-2.5-pro's reasoning processes.
Green bounding boxes represent ground-truth while red boxes represent Gemini-2.5-pro's prediction.
The error in reasoning process made by Gemini-2.5-pro is highlighted in bold.

Non-thinking model v.s. thinking model on Anywhere3D-Bench

Here we present a comparison between the best-performing non-thinking model(GPT-4.1) and the best performing thinking model(Gemini-2.5-pro) on Anywhere3D-Bench.
Green bounding boxes represent ground-truth while red boxes represent each model's prediction.
The error in reasoning process made by Gemini-2.5-pro is highlighted in bold.

Acknowledgements

We would like to especially thank ScanRefer for providing an excellent 3D annotation interface, which greatly facilitated the annotation process. We also appreciate the modifications made by SQA3D to the ScanRefer annotation interface. The annotation interface used in Anywhere3D was adapted from their well-designed interfaces. We are deeply grateful for their wonderful design and generous sharing with the community.

Also, we would like to thank the open source of the following projects:

3D Datasets: ScanNet, MultiScan, 3RScan, ARKitScenes

3D Visual Grounding Models: 3D-VisTA, PQ3D, Chat-Scene

MLLMs: GPT4Scene, LLaVA-OneVision, Qwen2.5-VL

LLMs: DeepSeek-R1, Qwen3, Qwen2.5

We also wish to thank the numerous inspiring works on 3D visual grounding and spatial intelligence that have informed and motivated our research, though it is difficult to list all of them here.

BibTeX

@misc{anywhere3d, title={From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes}, author={Tianxu Wang and Zhuofan Zhang and Ziyu Zhu and Yue Fan and Jing Xiong and Pengxiang Li and Xiaojian Ma and Qing Li}, year={2025}, eprint={2506.04897}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.04897}, }

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Abstract

Anywhere3D Dataset

Annotation Tool Demo

Examples from the Anywhere3D Benchmark

Quantitative Results on Anywhere3D-Bench

Gemini-2.5-pro's error breakdown on Anywhere3D-Bench

Qualitative Results of Gemini-2.5-pro on Anywhere3D-Bench

Non-thinking model v.s. thinking model on Anywhere3D-Bench

Acknowledgements

BibTeX