Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
CoRR(2024)
Abstract
This paper introduces Scene-LLM, a 3D-visual-language model that enhances
embodied agents' abilities in interactive 3D indoor environments by integrating
the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a
hybrid 3D visual feature representation, that incorporates dense spatial
information and supports scene state updates. The model employs a projection
layer to efficiently project these features in the pre-trained textual
embedding space, enabling effective interpretation of 3D visual information.
Unique to our approach is the integration of both scene-level and ego-centric
3D information. This combination is pivotal for interactive planning, where
scene-level data supports global planning and ego-centric data is important for
localization. Notably, we use ego-centric 3D frame features for feature
alignment, an efficient technique that enhances the model's ability to align
features of small objects within the scene. Our experiments with Scene-LLM
demonstrate its strong capabilities in dense captioning, question answering,
and interactive planning. We believe Scene-LLM advances the field of 3D visual
understanding and reasoning, offering new possibilities for sophisticated agent
interactions in indoor settings.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined