Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition
arxiv(2024)
摘要
In text recognition, self-supervised pre-training emerges as a good solution
to reduce dependence on expansive annotated real data. Previous studies
primarily focus on local visual representation by leveraging mask image
modeling or sequence contrastive learning. However, they omit modeling the
linguistic information in text images, which is crucial for recognizing text.
To simultaneously capture local character features and linguistic information
in visual space, we propose Symmetric Superimposition Modeling (SSM). The
objective of SSM is to reconstruct the direction-specific pixel and feature
signals from the symmetrically superimposed input. Specifically, we add the
original image with its inverted views to create the symmetrically superimposed
inputs. At the pixel level, we reconstruct the original and inverted images to
capture character shapes and texture-level linguistic context. At the feature
level, we reconstruct the feature of the same original image and inverted image
with different augmentations to model the semantic-level linguistic context and
the local character discrimination. In our design, we disrupt the character
shape and linguistic rules. Consequently, the dual-level reconstruction
facilitates understanding character shapes and linguistic information from the
perspective of visual texture and feature semantics. Experiments on various
text recognition benchmarks demonstrate the effectiveness and generality of
SSM, with 4.1
word accuracy on Union14M benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要