In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications.
Here, we showcase several examples from the proposed 3DCoMPaT-GrIn dataset, featuring colored point clouds alongside their corresponding grounded descriptions. Each highlighted phrase in the text is linked to its corresponding location in the point cloud. As shown, the collected data effectively captures the diverse components of 3D objects, accurately representing each part-level element and its spatial position.
The Kestrel model incorporates a point encoder and an LLM to construct a 3D MLLM, designed to generate detailed descriptions based on the input point cloud and text. The 3D Segmentation Decoder extracts the output embedding of the [SEG] token from the output hidden states of the 3D MLLM. After projecting these [SEG] embeddings, the 3D SGM uses them as initial queries . The point feature propagation module (PFPM) encodes multi-level point features . Then, the segmentation decoder takes and as input to generate the point-wise segmentation masks using a query refinement mechanism.
We report the performance of Kestrel on grounded description generation, direct segmentation, and reasoning-based segmentation across our proposed 3DCoMPaT-GrIn dataset as well as other public benchmarks. To further demonstrate the effectiveness and generalizability of our model, we also evaluate Kestrel in out-of-domain and real-world scenarios. Please refer to our paper for detailed quantitative results. We provide several visualization examples here to qualitatively showcase Kestrel's capabilities!
@article{fei2024kestrel,
title={Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding},
author={Fei, Junjie and Ahmed, Mahmoud and Ding, Jian and Bakr, Eslam Mohamed and Elhoseiny, Mohamed},
journal={arXiv preprint arXiv:2405.18937},
year={2024}
}