Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory
CoRR(2024)
摘要
Deep-learning and large scale language-image training have produced image
object detectors that generalise well to diverse environments and semantic
classes. However, single-image object detectors trained on internet data are
not optimally tailored for the embodied conditions inherent in robotics.
Instead, robots must detect objects from complex multi-modal data streams
involving depth, localisation and temporal correlation, a task termed embodied
object detection. Paradigms such as Video Object Detection (VOD) and Semantic
Mapping have been proposed to leverage such embodied data streams, but existing
work fails to enhance performance using language-image training. In response,
we investigate how an image object detector pre-trained using language-image
data can be extended to perform embodied object detection. We propose a novel
implicit object memory that uses projective geometry to aggregate the features
of detected objects across long temporal horizons. The spatial and temporal
information accumulated in memory is then used to enhance the image features of
the base detector. When tested on embodied data streams sampled from diverse
indoor scenes, our approach improves the base object detector by 3.09 mAP,
outperforming alternative external memories designed for VOD and Semantic
Mapping. Our method also shows a significant improvement of 16.90 mAP relative
to baselines that perform embodied object detection without first training on
language-image data, and is robust to sensor noise and domain shift experienced
in real-world deployment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要