Kestrel

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Abstract

In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications.

3DCoMPaT-GrIn

Here, we showcase several examples from the proposed 3DCoMPaT-GrIn dataset, featuring colored point clouds alongside their corresponding grounded descriptions. Each highlighted phrase in the text is linked to its corresponding location in the point cloud. As shown, the collected data effectively captures the diverse components of 3D objects, accurately representing each part-level element and its spatial position.

Model

The Kestrel model incorporates a point encoder and an LLM to construct a 3D MLLM, designed to generate detailed descriptions based on the input point cloud and text. The 3D Segmentation Decoder extracts the output embedding of the [SEG] token from the output hidden states of the 3D MLLM. After projecting these [SEG] embeddings, the 3D SGM uses them as initial queries $q_{0}$ . The point feature propagation module (PFPM) encodes multi-level point features $f_{p}$ . Then, the segmentation decoder takes $q_{0}$ and $f_{p}$ as input to generate the point-wise segmentation masks using a query refinement mechanism.

Experiments

We report the performance of Kestrel on grounded description generation, direct segmentation, and reasoning-based segmentation across our proposed 3DCoMPaT-GrIn dataset as well as other public benchmarks. To further demonstrate the effectiveness and generalizability of our model, we also evaluate Kestrel in out-of-domain and real-world scenarios. Please refer to our paper for detailed quantitative results. We provide several visualization examples here to qualitatively showcase Kestrel's capabilities!

BibTeX

@article{fei2024kestrel, title={Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding}, author={Fei, Junjie and Ahmed, Mahmoud and Ding, Jian and Bakr, Eslam Mohamed and Elhoseiny, Mohamed}, journal={arXiv preprint arXiv:2405.18937}, year={2024} }

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Abstract

3DCoMPaT-GrIn

Model

Experiments

Qualitative results of Kestrel on Part-Aware Point Grounded Description.

Qualitative results of Kestrel on Single-Part Grounding.

Out-of-Domain Generalization. Kestrel demonstrates robustness when there is a domain shift from 3DCoMPaT-GrIn to Objaverse, as well as the input distribution offsets from 3D single-object training to 3D multi-object testing.

Real-Word Demos. Kestrel shows a certain degree of robustness to noisy and incomplete real-world inputs.

BibTeX