Scalable 3D Panoptic Segmentation As Superpoint Graph Clustering

3DV 2024 Oral 🎤

Damien Robert^1,2, Hugo Raguet³, Loïc Landrieu^2,4

¹CSAI, ENGIE Lab CRIGEN, France
²LASTIG, IGN, ENSG, Univ. Gustave Eiffel, France
³INSA Centre Val-de-Loire Univ de Tours, LIFAT, France
⁴LIGM, Ecole des Ponts, Univ. Gustave Eiffel, France

Code

Paper

Abstract¶

We introduce a highly efficient method for panoptic segmentation of large 3D point clouds by redefining this task as a scalable graph clustering problem. This approach can be trained using only local auxiliary tasks, thereby eliminating the resource-intensive instance-matching step during training. Moreover, our formulation can easily be adapted to the superpoint paradigm 🧩, further increasing its efficiency. This allows our model to process scenes with millions of points and thousands of objects in a single inferenceon one GPU ⚡. Our method, called SuperCluster, achieves a new state-of-the-art panoptic segmentation performance for two indoor scanning datasets: 50.1 PQ (+7.8) for S3DIS Area 5, and 58.7 PQ (+25.2) for ScanNetV2. We also set the first state-of-the-art for two large-scale mobile mapping benchmarks: KITTI-360 and DALES. With only 209k parameters 🦋, our model is over 30 times smaller than the best-competing method and trains up to 15 times faster ⚡. Our code and pretrained models are available at github.com/drprojects/superpoint_transformer

Motivation¶

Existing panoptic segmentation methods do not scale to large 3D scenes due to several limitations:

⚙️ Costly matching operation at each training step
🔒 Fixed number of predictions
🎭 Each prediction mask has the size of the scene
🐘 Large backbone

This project proposes a scalable approach for addressing 3D panoptic segmentation. To this end, we formulate panoptic segmentation as the solution of a superpoint graph clustering problem.

Panoptic segmentation

Superpoint partition

Take the above superpoint partition and the desired panoptic segmentation. Instead of learning to classify and assign an instance to each individual point, we propose to learn to group superpoints together.

Intuitively, we want to group adjacent superpoints together if they are spatially close, have the same class, and are not separated by a border. We translate these goals into the following (superpoint) graph optimization problem.

Our idea is to train a model to predict the input parameters for this optimization problem, without explicitly asking the model to solve the panoptic segmentation task. If the model does its job, we should only need to solve the graph clustering problem at inference time, circumventing several limitations of existing panoptic segmentation methods.

Building on our previous Superpoint Transformer work, we already have the building blocks for building a graph of adjacent superpoints and train a model to classify them. In this work, we introduce a new head to Superpoint Transformer that learns to predict an affinity for each edge between two adjacent superpoints, indicating whether they belong to the same instance.

Interestingly, this SuperCluster model is only trained with local per-node and per-edge objectives. As previously mentioned, we do not need to explicitly compute the panoptic segmentation at training time. This bypasses the need for a matching step between predicted and target instance for computing losses and metrics. At inference time, we use a fast algorithm that finds an approximate solution to the (small) graph optimization problem, yielding the final panoptic segmentation prediction.

Results¶

SuperCluster achieves state-of-the-art results for 3D panoptic segmentation on large-scale indoor datasets such as S3DIS and ScanNetV2, and sets a first state-of-the-art on large-scale outdoor datasets such as DALES and KITTI-360.

📊 SOTA on S3DIS 6-Fold (55.9 PQ)
📊 SOTA on S3DIS Area 5 (50.1 PQ)
📊 SOTA on ScanNet Val (58.7 PQ)
📊 FIRST on KITTI-360 Val (48.3 PQ)
📊 FIRST on DALES (61.2 PQ)
🦋 212k parameters (PointGroup ÷ 37)
⚡ S3DIS training in 4 GPU-hours
⚡ 7.8km² tile of 18M points in 10.1s on 1 GPU

SuperCluster is capable of processing 3D scenes of unprecedented scale at once on a single GPU.

S3DIS

4.6 floors | 21.3M points | 4565/5298 predicted objects | 7.4 s inference | single 48G GPU

ScanNetV2

105 scans | 10.9M points | 2148/1683 predicted objects | 6.8 s inference | single 48G GPU

KITTI-360

7.5 tiles | 11.0M points | 1947/602 predicted objects | 6.6 s inference | single 48G GPU

DALES

7.8 km2 | 18.0M points | 1559/1727 predicted objects | 10.1 s inference | single 48G GPU

Interactive visualizations¶