Abstract

SuperCluster

We introduce a highly efficient method for panoptic segmentation of large 3D point clouds by redefining this task as a scalable graph clustering problem. This approach can be trained using only local auxiliary tasks, thereby eliminating the resource-intensive instance-matching step during training. Moreover, our formulation can easily be adapted to the superpoint paradigm ๐Ÿงฉ, further increasing its efficiency. This allows our model to process scenes with millions of points and thousands of objects in a single inferenceon one GPU โšก. Our method, called SuperCluster, achieves a new state-of-the-art panoptic segmentation performance for two indoor scanning datasets: 50.1 PQ (+7.8) for S3DIS Area 5, and 58.7 PQ (+25.2) for ScanNetV2. We also set the first state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and DALES. With only 209k parameters ๐Ÿฆ‹, our model is over 30 times smaller than the best-competing method and trains up to 15 times faster โšก. Our code and pretrained models are available at github.com/drprojects/superpoint_transformer

Motivation

Existing panoptic segmentation methods do not scale to large 3D scenes due to several limitations:

โš™๏ธ Costly matching operation at each training step
๐Ÿ”’ Fixed number of predictions
๐ŸŽญ Each prediction mask has the size of the scene
๐Ÿ˜ Large backbone

This project proposes a scalable approach for addressing 3D panoptic segmentation. To this end, we formulate panoptic segmentation as the solution of a superpoint graph clustering problem.

panoptic segmentation
superpoint partition

Panoptic segmentation

Take the above superpoint partition and the desired panoptic segmentation. Instead of learning to classify and assign an instance to each individual point, we propose to learn to group superpoints together.

Intuitively, we want to group adjacent superpoints together if they are spatially close, have the same class, and are not separated by a border. We translate these goals into the following (superpoint) graph optimization problem.


supercluster pipeline and equation

Our idea is to train a model to predict the input parameters for this optimization problem, without explicitly asking the model to solve the panoptic segmentation task. If the model does its job, we should only need to solve the graph clustering problem at inference time, circumventing several limitations of existing panoptic segmentation methods.

Building on our previous Superpoint Transformer work, we already have the building blocks for building a graph of adjacent superpoints and train a model to classify them. In this work, we introduce a new head to Superpoint Transformer that learns to predict an affinity for each edge between two adjacent superpoints, indicating whether they belong to the same instance.

Interestingly, this SuperCluster model is only trained with local per-node and per-edge objectives. As previously mentioned, we do not need to explicitly compute the panoptic segmentation at training time. This bypasses the need for a matching step between predicted and target instance for computing losses and metrics. At inference time, we use a fast algorithm that finds an approximate solution to the (small) graph optimization problem, yielding the final panoptic segmentation prediction.

Results

SuperCluster achieves state-of-the-art results for 3D panoptic segmentation on large-scale indoor datasets such as S3DIS and ScanNetV2, and sets a first state-of-the-art on large-scale outdoor datasets such as DALES and KITTI-360.

๐Ÿ“Š SOTA on S3DIS 6-Fold (55.9 PQ)
๐Ÿ“Š SOTA on S3DIS Area 5 (50.1 PQ)
๐Ÿ“Š SOTA on ScanNet Val (58.7 PQ)
๐Ÿ“Š FIRST on KITTI-360 Val (48.3 PQ)
๐Ÿ“Š FIRST on DALES (61.2 PQ)
๐Ÿฆ‹ 212k parameters (PointGroup รท 37)
โšก S3DIS training in 4 GPU-hours
โšก 7.8kmยฒ tile of 18M points in 10.1s on 1 GPU

SuperCluster is capable of processing 3D scenes of unprecedented scale at once on a single GPU.

S3DIS inference on one GPU

S3DIS

4.6 floors | 21.3M points | 4565/5298 predicted objects | 7.4 s inference | single 48G GPU



ScanNetV2 inference on one GPU

ScanNetV2

105 scans | 10.9M points | 2148/1683 predicted objects | 6.8 s inference | single 48G GPU



KITTI-360 inference on one GPU

KITTI-360

7.5 tiles | 11.0M points | 1947/602 predicted objects | 6.6 s inference | single 48G GPU



DALES inference on one GPU

DALES

7.8 km2 | 18.0M points | 1559/1727 predicted objects | 10.1 s inference | single 48G GPU



Interactive visualizations