Can we integrate color and depth information for richer captions?

2023 5th Novel Intelligent and Leading Emerging Sciences Conference (NILES)(2023)

Cited 0|Views2
No score
Abstract
While image captioning models have successfully produced high-quality descriptions, most of the research has focused on providing a single sentence for 2D images. This study explores whether adding depth information to RGB images can improve captioning and generate better descriptions. We propose a Transformer-based encoder-decoder model to generate a multi-sentence description of a 3D scene. Our framework takes an RGB image and its corresponding depth map as input and combines them to create a more comprehensive understanding of the input scene. We investigated various fusion strategies for combining RGB and depth images. We conducted experiments using the NYU-v2 dataset. However, while working with the NYU-v2 dataset, we discovered inconsistent labeling, which undermines the benefit of utilizing depth information to improve the captioning process. As a result, the results were considerably worse than when using RGB images alone. Therefore, we suggest a more accurate and consistent version of the NYU-v2 dataset. Our results demonstrate that the proposed framework effectively benefits from depth information and produces better captions.
More
Translated text
Key words
3D Scene,Image captioning,Depth fusion,Transformers
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined