MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images
CoRR(2023)
摘要
Semantic segmentation of remote sensing images plays a vital role in a wide
range of Earth Observation (EO) applications, such as land use land cover
mapping, environment monitoring, and sustainable development. Driven by rapid
developments in Artificial Intelligence (AI), deep learning (DL) has emerged as
the mainstream tool for semantic segmentation and has achieved many
breakthroughs in the field of remote sensing. However, the existing DL-based
methods mainly focus on unimodal visual data while ignoring the rich multimodal
information involved in the real world, usually demonstrating weak reliability
and generlization. Inspired by the success of Vision Transformers and large
language models, we propose a novel metadata-collaborative multimodal
segmentation network (MetaSegNet) that applies vision-language representation
learning for semantic segmentation of remote sensing images. Unlike the common
model structure that only uses unimodal visual data, we extract the key
characteristic (e.g. the climate zone) from freely available remote sensing
image metadata and transfer it into knowledge-based text prompts via the
generic ChatGPT. Then, we construct an image encoder, a text encoder and a
cross-modal attention fusion subnetwork to extract the image and text feature
and apply image-text interaction. Benefiting from such a design, the proposed
MetaSegNet demonstrates superior generalization and achieves competitive
accuracy with the state-of-the-art semantic segmentation methods on the
large-scale OpenEarthMap dataset (68.6
F1 score) as well as LoveDA dataset (52.2
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要