MY ALT TEXT

Part-Aware Point Grounded Description. Given an input point cloud, the model is tasked with predicting a grounded description - text that provides a detailed interpretation of the 3D object, each part-level phrase in this generated text (e.g., backrest and seat support) is linked to a point-wise segmentation mask, challenging the model's capability for part-aware language understanding and segmentation grounding (it is worth noting that the colors shown in this figure are not the actual colors of the point cloud but are used to represent the different segmentation masks).

Abstract

In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications.

3DCoMPaT-GrIn

Here, we showcase several examples from the proposed 3DCoMPaT-GrIn dataset, featuring colored point clouds alongside their corresponding grounded descriptions. Each highlighted phrase in the text is linked to its corresponding location in the point cloud. As shown, the collected data effectively captures the diverse components of 3D objects, accurately representing each part-level element and its spatial position.

MY ALT TEXT

Model

The Kestrel model incorporates a point encoder and an LLM to construct a 3D MLLM, designed to generate detailed descriptions based on the input point cloud and text. The 3D Segmentation Decoder extracts the output embedding of the [SEG] token from the output hidden states of the 3D MLLM. After projecting these [SEG] embeddings, the 3D SGM uses them as initial queries q0. The point feature propagation module (PFPM) encodes multi-level point features fp. Then, the segmentation decoder takes q0 and fp as input to generate the point-wise segmentation masks using a query refinement mechanism.

MY ALT TEXT

Experiments

We report the performance of Kestrel on grounded description generation, direct segmentation, and reasoning-based segmentation across our proposed 3DCoMPaT-GrIn dataset as well as other public benchmarks. To further demonstrate the effectiveness and generalizability of our model, we also evaluate Kestrel in out-of-domain and real-world scenarios. Please refer to our paper for detailed quantitative results. We provide several visualization examples here to qualitatively showcase Kestrel's capabilities!

BibTeX


        @article{fei2024kestrel,
          title={Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding},
          author={Fei, Junjie and Ahmed, Mahmoud and Ding, Jian and Bakr, Eslam Mohamed and Elhoseiny, Mohamed},
          journal={arXiv preprint arXiv:2405.18937},
          year={2024}
        }