Abstract¶
We introduce a highly efficient method for panoptic segmentation of large 3D point clouds by redefining this task as a scalable graph clustering problem. This approach can be trained using only local auxiliary tasks, thereby eliminating the resource-intensive instance-matching step during training. Moreover, our formulation can easily be adapted to the superpoint paradigm ๐งฉ, further increasing its efficiency. This allows our model to process scenes with millions of points and thousands of objects in a single inferenceon one GPU โก. Our method, called SuperCluster, achieves a new state-of-the-art panoptic segmentation performance for two indoor scanning datasets: 50.1 PQ (+7.8) for S3DIS Area 5, and 58.7 PQ (+25.2) for ScanNetV2. We also set the first state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and DALES. With only 209k parameters ๐ฆ, our model is over 30 times smaller than the best-competing method and trains up to 15 times faster โก. Our code and pretrained models are available at github.com/drprojects/superpoint_transformer
Motivation¶
Existing panoptic segmentation methods do not scale to large 3D scenes due to several limitations:
โ๏ธ Costly matching operation at each training step
๐ Fixed number of predictions
๐ญ Each prediction mask has the size of the scene
๐ Large backbone
This project proposes a scalable approach for addressing 3D panoptic segmentation. To this end, we formulate panoptic segmentation as the solution of a superpoint graph clustering problem.
Panoptic segmentation
Take the above superpoint partition and the desired panoptic segmentation. Instead of learning to classify and assign an instance to each individual point, we propose to learn to group superpoints together.
Intuitively, we want to group adjacent superpoints together if they are spatially close, have the same class, and are not separated by a border. We translate these goals into the following (superpoint) graph optimization problem.
Our idea is to train a model to predict the input parameters for this optimization problem, without explicitly asking the model to solve the panoptic segmentation task. If the model does its job, we should only need to solve the graph clustering problem at inference time, circumventing several limitations of existing panoptic segmentation methods.
Building on our previous Superpoint Transformer work, we already have the building blocks for building a graph of adjacent superpoints and train a model to classify them. In this work, we introduce a new head to Superpoint Transformer that learns to predict an affinity for each edge between two adjacent superpoints, indicating whether they belong to the same instance.
Interestingly, this SuperCluster model is only trained with local per-node and per-edge objectives. As previously mentioned, we do not need to explicitly compute the panoptic segmentation at training time. This bypasses the need for a matching step between predicted and target instance for computing losses and metrics. At inference time, we use a fast algorithm that finds an approximate solution to the (small) graph optimization problem, yielding the final panoptic segmentation prediction.
Results¶
SuperCluster achieves state-of-the-art results for 3D panoptic segmentation on large-scale indoor datasets such as S3DIS and ScanNetV2, and sets a first state-of-the-art on large-scale outdoor datasets such as DALES and KITTI-360.
๐ SOTA on S3DIS Area 5 (50.1 PQ)
๐ SOTA on ScanNet Val (58.7 PQ)
๐ FIRST on KITTI-360 Val (48.3 PQ)
๐ FIRST on DALES (61.2 PQ)
๐ฆ 212k parameters (PointGroup รท 37)
โก S3DIS training in 4 GPU-hours
โก 7.8kmยฒ tile of 18M points in 10.1s on 1 GPU
SuperCluster is capable of processing 3D scenes of unprecedented scale at once on a single GPU.
S3DIS
4.6 floors | 21.3M points | 4565/5298 predicted objects | 7.4 s inference | single 48G GPU
ScanNetV2
105 scans | 10.9M points | 2148/1683 predicted objects | 6.8 s inference | single 48G GPU
KITTI-360
7.5 tiles | 11.0M points | 1947/602 predicted objects | 6.6 s inference | single 48G GPU
DALES
7.8 km2 | 18.0M points | 1559/1727 predicted objects | 10.1 s inference | single 48G GPU