C-BEV: Contrastive Bird's Eye View Training for Cross-View Image Retrieval and 3-DoF Pose Estimation
CoRR(2023)
摘要
To find the geolocation of a street-view image, cross-view geolocalization
(CVGL) methods typically perform image retrieval on a database of georeferenced
aerial images and determine the location from the visually most similar match.
Recent approaches focus mainly on settings where street-view and aerial images
are preselected to align w.r.t. translation or orientation, but struggle in
challenging real-world scenarios where varying camera poses have to be matched
to the same aerial image. We propose a novel trainable retrieval architecture
that uses bird's eye view (BEV) maps rather than vectors as embedding
representation, and explicitly addresses the many-to-one ambiguity that arises
in real-world scenarios. The BEV-based retrieval is trained using the same
contrastive setting and loss as classical retrieval.
Our method C-BEV surpasses the state-of-the-art on the retrieval task on
multiple datasets by a large margin. It is particularly effective in
challenging many-to-one scenarios, e.g. increasing the top-1 recall on VIGOR's
cross-area split with unknown orientation from 31.1% to 65.0%. Although the
model is supervised only through a contrastive objective applied on image
pairings, it additionally learns to infer the 3-DoF camera pose on the matching
aerial image, and even yields a lower mean pose error than recent methods that
are explicitly trained with metric groundtruth.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要