Symbolizing Visual Features for Pre-training with Unlabeled Images.

Asian Conference on Pattern Recognition (ACPR)(2021)

引用 0|浏览2
暂无评分
摘要
Multi-layer Transformers, which have shown good performance in natural language processing (NLP), have recently started to be used in multi-modal learning tasks that involve both texts and images. In the NLP part of the multi-modal learning, the approach of pre-training the parameters of Transformers from large unlabeled text data has been shown to contribute to an increase in accuracy. On the other hand, for the image part of the Transformer, there are no reports to show the validity of pre-training, even though, intuitively, the prospect of leveraging knowledge obtained from large amounts of unlabeled image data is appealing. This paper aims to construct a single modal pre-training model based on a Transformer in the image domain for multi-modal learning of texts and images. We have found that, unlike the case of discrete values representing word embeddings, current Transformers have trouble handling continuous values like image features. In order to overcome this limitation, we propose a Transformer with the list of features named SymboList which convert the continuous image features of detected objects into discrete ones by referring to a discrete key list. We demonstrate that our proposed method leads to effective image pre-training and is beneficial to the multi-modal down-stream task.
更多
查看译文
关键词
Multi-modal transformer,Image pre-training,Visual Question Answering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要