BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
arxiv(2022)
摘要
Vision-Language (VL) models with the Two-Tower architecture have dominated
visual-language representation learning in recent years. Current VL models
either use lightweight uni-modal encoders and learn to extract, align and fuse
both modalities simultaneously in a deep cross-modal encoder, or feed the
last-layer uni-modal representations from the deep pre-trained uni-modal
encoders into the top cross-modal encoder. Both approaches potentially restrict
vision-language representation learning and limit model performance. In this
paper, we propose BridgeTower, which introduces multiple bridge layers that
build a connection between the top layers of uni-modal encoders and each layer
of the cross-modal encoder. This enables effective bottom-up cross-modal
alignment and fusion between visual and textual representations of different
semantic levels of pre-trained uni-modal encoders in the cross-modal encoder.
Pre-trained with only 4M images, BridgeTower achieves state-of-the-art
performance on various downstream vision-language tasks. In particular, on the
VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73
the previous state-of-the-art model METER by 1.09
data and almost negligible additional parameters and computational costs.
Notably, when further scaling the model, BridgeTower achieves an accuracy of
81.15
datasets. Code and checkpoints are available at
https://github.com/microsoft/BridgeTower.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要