Learning single and multi-scene camera pose regression with transformer encoders

Computer Vision and Image Understanding（2024）

引用 0|浏览0

暂无评分

摘要

Contemporary state-of-the-art localization methods perform feature matching against a structured scene model or learn to regress the scene 3D coordinates. The resulting matches between 2D query pixels and 3D scene coordinates are used to estimate the camera pose using PnP and RANSAC, requiring the camera intrinsics for both the query and reference images. An alternative approach is to directly regress the camera pose from the query image. Although less accurate, absolute camera pose regression does not require any additional information at inference time and is typically lightweight and fast. Recently, Transformers were proposed for learning multi-scene camera pose regression, employing encoders to attend to spatially varying deep features while using decoders to embed multiple scene queries at once. In this work, we show that Transformer Encoders can aggregate and extract task-informative latent representations for learning both single- and multi- scene camera pose regression, without Transformer-Decoders. Our approach is shown to reduce the runtime and memory of previous Transformer-based multi-scene solutions, while comparing favorably with contemporary pose regression schemes and achieving state-of-the-art accuracy on multiple indoor and outdoor regression benchmarks. In particular, to the best of our knowledge, our approach is the first absolute regression approach to attain sub-meter average accuracy across outdoor scenes. We make our code publicly available at: https://github.com/yolish/transposenet.

查看译文

关键词

Deep learning,Transformers,Camera pose estimation,Localization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要