Centered Masking for Language-Image Pre-Training
arxiv(2024)
摘要
We introduce Gaussian masking for Language-Image Pre-Training (GLIP) a novel,
straightforward, and effective technique for masking image patches during
pre-training of a vision-language model. GLIP builds on Fast Language-Image
Pre-Training (FLIP), which randomly masks image patches while training a CLIP
model. GLIP replaces random masking with centered masking, that uses a Gaussian
distribution and is inspired by the importance of image patches at the center
of the image. GLIP retains the same computational savings as FLIP, while
improving performance across a range of downstream datasets and tasks, as
demonstrated by our experimental results. We show the benefits of GLIP to be
easy to obtain, requiring no delicate tuning of the Gaussian, and also
applicable to data sets containing images without an obvious center focus.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要