N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang^1,2, Lei Ke², Boqiang Zhang², Tianyuan Qu^2,3, Hanxun Yu^2,4,
Zhenpeng Huang^2,5, Meng Yu², Dan Xu^1✉️, Dong Yu²
¹HKUST, ²Tencent AI Lab, ³CUHK, ⁴ZJU, ⁵NJU

Teaser.mp4

Overview

N3D-VLM is a unified vision-language model for native 3D grounding and 3D spatial reasoning. By incorporating native 3D grounding, our model enables precise spatial reasoning, allowing users to query object relationships, distances, and attributes directly within complex 3D environments.

Updates

2025/12/19: We released this repo with the pre-trained model and inference code.

Installation

git clone --recursive https://github.com/W-Ted/N3D-VLM.git
cd N3D-VLM

conda env create -n n3d_vlm python=3.11 -y
conda activate n3d_vlm
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

Pre-trained model

We provide the pre-trained model here.

Inference

We provide three examples for inference of N3D-VLM. You could check the source files in data directory, where *.jpg are the source images and *.npz are the monocular point clouds obtained by using MoGe2.

# inference 
python demo.py

Demo 1

rotate-22.mp4

Demo 2

Demo.2.mp4

Demo 3

Demo.3.mp4

After running the code above, the inference results will be saved in the outputs directory, including generated answers in *.json format, and 3D grounding results in *.rrd format. The rrd files can be visualized by using Rerun:

rerun outputs/demo1.rrd

If you want to do the 3D Detection only, please check the example as below.

# inference 
python detection.py
# visualization
rerun outputs/test1.rrd

Citation

@article{wang2025n3d,
    title={N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models},
    author={Wang, Yuxin and Ke, Lei and Zhang, Boqiang and Qu, Tianyuan and Yu, Hanxun and Huang, Zhenpeng and Yu, Meng and Xu, Dan and Yu, Dong},
    journal={arXiv preprint arXiv:2512.16561},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
data		data
outputs/demo		outputs/demo
qwen2_5_vl		qwen2_5_vl
src		src
third_party		third_party
.gitmodules		.gitmodules
README.md		README.md
demo.py		demo.py
detection.py		detection.py
pcd.py		pcd.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Overview

Updates

Installation

Pre-trained model

Inference

Demo 1

Demo 2

Demo 3

Citation

About

Uh oh!

Releases

Packages

Languages

W-Ted/N3D-VLM

Folders and files

Latest commit

History

Repository files navigation

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Overview

Updates

Installation

Pre-trained model

Inference

Demo 1

Demo 2

Demo 3

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages