BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yihan Cao, Renjiao Yi, Yijie Wang, Kai Xu

This repository includes the public implementation of BoxFusion.

📢 News

2025-12-23: Evaluation code is released.
2025-10-31: Guidelines for ROS2 demo (to be oganized).
2025-08-30: Code is released.
2025-08-10: BoxFusion is accepted by Pacific Graphics 2025 (Journal Track), the top 5% paper.

📋 TODO

Release the codes and demos.
Release the online ROS demo for detecting neighboring objects while the user/agent is scanning.
Release the evaluation code

1. Installation

Please create the virtual environment with python3.10 and a recent 2.x build of PyTorch. The code has been tested on Ubuntu22.04 and CUDA 11.8. The environment can be created like:

conda create -n boxfusion python=3.10 
conda activate boxfusion

Install PyTorch:

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118

Then you can install the dependencies:

pip install -r requirements.txt
pip install -e .

2. Quick Start

1.Download the pre-trained RGB-D model Cubify Anything. Please the follow the license of Cubify Anything. Similarly, you need to download the CLIP. After you download these models, please move them into the models folder.

models
  |-- cutr_rgbd.pth
  |-- open_clip_pytorch_model.bin

2.Download the example data from google drive. Move it into data folder and unzip the data.

3.Run the demo.py using the example data for a quick start. This demo will load the data automatically, and the visualization will present the sequential RGB, depth, 3D object boxes, with the camera trajtory. You can change the configuration to the customized data. Note that, the online visualization process will slightly slow down the system FPS, and you can switch to rerun=False in the config file for acceleration.

python demo.py CA1M --model-path ./models/cutr_rgbd.pth  --config ./config/ca1m.yaml --device cuda --seq 42898867

3. Data Preparation

Basically, we organize the data like most SLAM methods. There are two datasets we utilize in the benchmark: CA-1M and ScanNetV2. If you want to test all sequences on these two datasets, please follow the steps in this section.

CA-1M

1.Following Cubify Anything, please download the data with the links in data/val.txt. You can use wget, curl or any tool to download the data you want. As for the evaluation, all sequences in data/val.txt are required.

2.Prepare the data structure according to README.

The data structure is like this:

[Structure for CA-1M dataset (click to expand)]

CA-1M/
├── 48458654/                
│   ├── depth/               # Folder containing depth images
│   ├── rgb/                 # Folder containing RGB color images
│   ├── after_filter_boxes.npy  # Filtered gt 3D bounding boxes 
│   ├── all_poses.npy        # Camera poses for a sequence of frames [N,4,4]
│   ├── instances.json       # Instance segmentation or object detection results
│   ├── K_depth.txt          # Intrinsic camera matrix for the depth sensor
│   ├── K_rgb.txt            # Intrinsic camera matrix for the RGB sensor
│   ├── mesh.ply             # Reconstructed 3D mesh
│   └── T_gravity.npy        # Transformation matrix for gravity alignment

ScanNetV2

Please follow the process of ScanNetV2 to download the validation sets. We use the default data structure like this:

[Structure for ScanNetV2 dataset (click to expand)]

ScanNet/
├── scene0xxx_0x/
│   ├── frames/
│   │   ├── color/
│   │   │   ├── 0.jpg
│   │   │   ├── 1.jpg
│   │   │   └── ...
│   │   ├── depth/
│   │   │   ├── 0.png
│   │   │   ├── 1.png
│   │   │   └── ...
│   │   ├── pose/
│   │   │   ├── 0.txt
│   │   │   ├── 1.txt
│   │   │   └── ...
│   │   ├── intrinsic/
│   │   │   └── intrinsic_depth.txt
│   └── scene0xxx_0x.txt
│   └── scene0xxx_0x_vh_clean_2.ply
└── ...

4. Run

In this section, we introduce how to run a given sequence (CA-1M dataset or ScanNetV2). After you have prepared the datasets according to the instructions above, you can run the following commands to try BoxFusion on the specific sequence.

CA-1M

Please change the datadir in the config/ca1m.yaml to root of your processed CA-1M dataset. Customize the --seq to the sequence you want to try.

python demo.py CA1M --model-path ./models/cutr_rgbd.pth  --config ./config/ca1m.yaml --device cuda --seq 42898867

ScanNetV2

Please change the datadir in the config/scannet.yaml to root of your processed ScanNetV2 dataset. Customize the --seq to the sequence you want to try.

python demo.py scannet --model-path ./models/cutr_rgbd.pth  --config ./config/scannet.yaml --device cuda --seq scene0169_00

Others

We recommend to prepare the data like ScanNetV2. Once you have prepared the data, you can instantiate a dataset object in this file, and use the similar command to try on your data.

5. ROS2 demo guideline

We update the basic code for the ROS2 version for BoxFusion. If you use online dataset, the dataloader will automatically listen the topic /rgb/image_raw for RGB, /depth/image_raw for depth and /trajectory for camera pose. You can try any method to obtain the posed RGB-D data, given the raw RGB-D images (We recommend the sparse method ORB-SLAM3).

After you have started the node for online pose estimation, you can run the following command to visualize the online detection of BoxFusion.

python demo.py online --model-path ./models/cutr_rgbd.pth  --config ./config/online.yaml --device cuda

6. Evaluation

Preparation

1.please prepare the full datasets including 107 sequences for CA-1M and 100 sequences for ScanNetV2 as mentioned above. The important files for GT boxes is after_filter_boxes.npy for CA-1M and scannet_train_detection_data for ScanNetV2.

2.Please follow the instructions to prepare CA-1M. We also provide the processed datasets of CA-1M validation sets in the format similar to ScanNet. Check this at huggingface. It is recommended to download the preprocessed sequences.

3.For ScanNetV2, you can follow the intructions in ./evaluation/data_util/README.md, or just download the preprocessed GT data in google drive. Please move the scannet_train_detection_data directory to ./evaluation/data_util/.

Run

Please use the commands in Section 4 to run the full sequences. In the paper, we run all the 107 validation scenes in CA-1M and uniformly select 100 scenes (see ./evaluation/data_util/meta_data/scannetv2_val.txt) in ScanNetV2. Please sepecify the --data_path to your processed root. Use the following command as an example to run the evaluation.

cd evaluation

CA-1M

python eval_ca1m.py --dataset ca1m --data_path /media/lyq/temp/dataset/CA-1M-slam/ --dump_dir eval_ca1m --cluster_sampling seed_fps --use_3d_nms --use_cls_nms --per_class_proposal --gpu 0 --pred_root /home/lyq/myprojects/boxfusion/results/full/

ScanNetV2

python eval_scannet.py --dataset scannet --data_path /media/lyq/mydata/Dataset/ScanNet/ --dump_dir eval_scannet --num_point 40000 --cluster_sampling seed_fps --use_3d_nms --use_cls_nms --per_class_proposal --gpu 0 --pred_root /home/lyq/myprojects/boxfusion/results/scannet/

Discussion for other methods

For the compared methods in this paper (e.g., FCAF3D and SpatialLM), you need to rotate the input point cloud to be axis-aligned. Please refer to the repository of our co-author.
For the online methods (e.g., EmbodiedSAM and OnlineAnySeg), you need to transform the segmented pointclouds to OBBs and perform the evaluation similarly. The reference code is here.
As the paper is submitted in early June, we report results of the 1.0 version of SpatialLM. The 1.1 version is better and we welcome if anyone can provide the newest results.

Thanks the co-author, please consider staring their github repository if you find it useful.

Acknowledgement

Parts of the code are modified from Cubify Anything. Thanks to the authors and please consider citing their papers.

Citation

If you find our work useful in your research, please consider giving a star ✨ and citing the following paper:

@inproceedings{lan2025boxfusion,
  title={BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion},
  author={Lan, Yuqing and Zhu, Chenyang and Gao, Zhirui and Zhang, Jiazhao and Cao, Yihan and Yi, Renjiao and Wang, Yijie and Xu, Kai},
  booktitle={Computer Graphics Forum},
  volume={44},
  number={7},
  pages={e70254},
  year={2025},
  organization={Wiley Online Library}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

📢 News

📋 TODO

1. Installation

2. Quick Start

3. Data Preparation

CA-1M

ScanNetV2

4. Run

CA-1M

ScanNetV2

Others

5. ROS2 demo guideline

6. Evaluation

Preparation

Run

CA-1M

ScanNetV2

Discussion for other methods

Acknowledgement

Citation

About

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.vscode		.vscode
boxfusion		boxfusion
config		config
data		data
data_process		data_process
evaluation		evaluation
models		models
results		results
tools		tools
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

📢 News

📋 TODO

1. Installation

2. Quick Start

3. Data Preparation

CA-1M

ScanNetV2

4. Run

CA-1M

ScanNetV2

Others

5. ROS2 demo guideline

6. Evaluation

Preparation

Run

CA-1M

ScanNetV2

Discussion for other methods

Acknowledgement

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks