Skip to content
/ N3D-VLM Public

Official code for paper: N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Notifications You must be signed in to change notification settings

W-Ted/N3D-VLM

Repository files navigation

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Yuxin Wang1,2, Lei Ke2, Boqiang Zhang2, Tianyuan Qu2,3, Hanxun Yu2,4,
Zhenpeng Huang2,5, Meng Yu2, Dan Xu1✉️, Dong Yu2
1HKUST, 2Tencent AI Lab, 3CUHK, 4ZJU, 5NJU

Teaser.mp4

Overview

N3D-VLM is a unified vision-language model for native 3D grounding and 3D spatial reasoning. By incorporating native 3D grounding, our model enables precise spatial reasoning, allowing users to query object relationships, distances, and attributes directly within complex 3D environments.

Updates

  • 2025/12/19: We released this repo with the pre-trained model and inference code.

Installation

git clone --recursive https://github.com/W-Ted/N3D-VLM.git
cd N3D-VLM

conda env create -n n3d_vlm python=3.11 -y
conda activate n3d_vlm
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

Pre-trained model

We provide the pre-trained model here.

Inference

We provide three examples for inference of N3D-VLM. You could check the source files in data directory, where *.jpg are the source images and *.npz are the monocular point clouds obtained by using MoGe2.

# inference 
python demo.py

Demo 1

rotate-22.mp4

Demo 2

Demo.2.mp4

Demo 3

Demo.3.mp4

After running the code above, the inference results will be saved in the outputs directory, including generated answers in *.json format, and 3D grounding results in *.rrd format. The rrd files can be visualized by using Rerun:

rerun outputs/demo1.rrd

If you want to do the 3D Detection only, please check the example as below.

# inference 
python detection.py
# visualization
rerun outputs/test1.rrd

Citation

@article{wang2025n3d,
    title={N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models},
    author={Wang, Yuxin and Ke, Lei and Zhang, Boqiang and Qu, Tianyuan and Yu, Hanxun and Huang, Zhenpeng and Yu, Meng and Xu, Dan and Yu, Dong},
    journal={arXiv preprint arXiv:2512.16561},
    year={2025}
}

About

Official code for paper: N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages