Language-Based Depth Hints for Monocular Depth Estimation
arxiv(2024)
摘要
Monocular depth estimation (MDE) is inherently ambiguous, as a given image
may result from many different 3D scenes and vice versa. To resolve this
ambiguity, an MDE system must make assumptions about the most likely 3D scenes
for a given input. These assumptions can be either explicit or implicit. In
this work, we demonstrate the use of natural language as a source of an
explicit prior about the structure of the world. The assumption is made that
human language encodes the likely distribution in depth-space of various
objects. We first show that a language model encodes this implicit bias during
training, and that it can be extracted using a very simple learned approach. We
then show that this prediction can be provided as an explicit source of
assumption to an MDE system, using an off-the-shelf instance segmentation model
that provides the labels used as the input to the language model. We
demonstrate the performance of our method on the NYUD2 dataset, showing
improvement compared to the baseline and to random controls.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要