Large scale document image retrieval by automatic word annotation
International Journal on Document Analysis and Recognition (IJDAR)(2013)
摘要
In this paper, we present a practical and scalable retrieval framework for large-scale document image collections, for an Indian language script that does not have a robust OCR. OCR-based methods face difficulties in character segmentation and recognition, especially for the complex Indian language scripts. We realize that character recognition is only an intermediate step toward actually labeling words. Hence, we re-pose the problem as one of directly performing word annotation. This new approach has better recognition performance, as well as easier segmentation requirements. However, the number of classes in word annotation is much larger than those for character recognition, making such a classification scheme expensive to train and test. To address this issue, we present a novel framework that replaces naive classification with a carefully designed mixture of indexing and classification schemes. This enables us to build a search system over a large collection of 1,000 books of Telugu, consisting of 120K document images or 36M individual words. This is the largest searchable document image collection for a script without an OCR that we are aware of. Our retrieval system performs significantly well with a mean average precision of 0.8.
更多查看译文
关键词
Image retrieval,Automatic annotation,Document images,OCR-free annotation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络