Cross-domain Multi-modal Few-shot Object Detection via Rich Text
arxiv(2024)
摘要
Cross-modal feature extraction and integration have led to steady performance
improvements in few-shot learning tasks due to generating richer features.
However, existing multi-modal object detection (MM-OD) methods degrade when
facing significant domain-shift and are sample insufficient. We hypothesize
that rich text information could more effectively help the model to build a
knowledge relationship between the vision instance and its language description
and can help mitigate domain shift. Specifically, we study the Cross-Domain
few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based
multi-modal few-shot object detection method that utilizes rich text semantic
information as an auxiliary modality to achieve domain adaptation in the
context of FSOD. Our proposed network contains (i) a multi-modal feature
aggregation module that aligns the vision and language support feature
embeddings and (ii) a rich text semantic rectify module that utilizes
bidirectional text feature generation to reinforce multi-modal feature
alignment and thus to enhance the model's language understanding capability. We
evaluate our model on common standard cross-domain object detection datasets
and demonstrate that our approach considerably outperforms existing FSOD
methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要