Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation
arxiv(2024)
摘要
Geolocating precise locations from images presents a challenging problem in
computer vision and information retrieval.Traditional methods typically employ
either classification, which dividing the Earth surface into grid cells and
classifying images accordingly, or retrieval, which identifying locations by
matching images with a database of image-location pairs. However,
classification-based approaches are limited by the cell size and cannot yield
precise predictions, while retrieval-based systems usually suffer from poor
search quality and inadequate coverage of the global landscape at varied scale
and aggregation levels. To overcome these drawbacks, we present Img2Loc, a
novel system that redefines image geolocalization as a text generation task.
This is achieved using cutting-edge large multi-modality models like GPT4V or
LLaVA with retrieval augmented generation. Img2Loc first employs CLIP-based
representations to generate an image-based coordinate query database. It then
uniquely combines query results with images itself, forming elaborate prompts
customized for LMMs. When tested on benchmark datasets such as Im2GPS3k and
YFCC4k, Img2Loc not only surpasses the performance of previous state-of-the-art
models but does so without any model training.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要