Yuxin Wang1,2,
Lei Ke2,
Boqiang Zhang2,
Tianyuan Qu2,3,
Hanxun Yu2,4,
Zhenpeng Huang2,5,
Meng Yu2,
Dan Xu1✉️,
Dong Yu2
1HKUST,
2Tencent AI Lab,
3CUHK,
4ZJU,
5NJU
Teaser.mp4
N3D-VLM is a unified vision-language model for native 3D grounding and 3D spatial reasoning. By incorporating native 3D grounding, our model enables precise spatial reasoning, allowing users to query object relationships, distances, and attributes directly within complex 3D environments.
2025/12/19: We released this repo with the pre-trained model and inference code.
git clone --recursive https://github.com/W-Ted/N3D-VLM.git
cd N3D-VLM
conda env create -n n3d_vlm python=3.11 -y
conda activate n3d_vlm
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
We provide the pre-trained model here.
We provide three examples for inference of N3D-VLM. You could check the source files in data directory, where *.jpg are the source images and *.npz are the monocular point clouds obtained by using MoGe2.
# inference
python demo.py
rotate-22.mp4
Demo.2.mp4
Demo.3.mp4
After running the code above, the inference results will be saved in the outputs directory, including generated answers in *.json format, and 3D grounding results in *.rrd format.
The rrd files can be visualized by using Rerun:
rerun outputs/demo1.rrd
If you want to do the 3D Detection only, please check the example as below.
# inference
python detection.py
# visualization
rerun outputs/test1.rrd
@article{wang2025n3d,
title={N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models},
author={Wang, Yuxin and Ke, Lei and Zhang, Boqiang and Qu, Tianyuan and Yu, Hanxun and Huang, Zhenpeng and Yu, Meng and Xu, Dan and Yu, Dong},
journal={arXiv preprint arXiv:2512.16561},
year={2025}
}