Improving OCR-based Image Captioning by Incorporating Geometrical Relationship

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021(2021)

引用 36|浏览68
暂无评分
摘要
OCR-based image captioning aims to automatically describe images based on all the visual entities (both visual objects and scene text) in images. Compared with conventional image captioning, the reasoning of scene text is required for OCR-based image captioning since the generated descriptions often contain multiple OCR tokens. Existing methods attempt to achieve this goal via encoding the OCR tokens with rich visual and semantic representations. However, strong correlations between OCR tokens may not be established with such limited representations. In this paper, we propose to enhance the connections between OCR tokens from the viewpoint of exploiting the geometrical relationship. We comprehensively consider the height, width, distance, IoU and orientation relations between the OCR tokens for constructing the geometrical relationship. To integrate the learned relation as well as the visual and semantic representations into a unified framework, a Long Short-Term Memory plus Relation-aware pointer network (LSTM-R) architecture is presented in this paper. Under the guidance of the geometrical relationship between OCR tokens, our LSTM-R capitalizes on a newly-devised relation-aware pointer network to select OCR tokens from the scene text for OCR-based image captioning. Extensive experiments demonstrate the effectiveness of our LSTM-R. More remarkably, LSTM-R achieves state-of-the-art performance on TextCaps, with the CIDEr-D score being increased from 98.0% to 109.3%.
更多
查看译文
关键词
OCR-based image captioning,scene text,conventional image captioning,multiple OCR tokens,geometrical relationship
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要