Character N-Gram Spotting on Handwritten Documents Using Weakly-Supervised Segmentation

Document Analysis and Recognition(2013)

引用 8|浏览0
暂无评分
摘要
In this paper, we present a solution towards building a retrieval system over handwritten document images that i) is recognition-free, ii) allows text-querying, iii) can retrieve at sub-word level, iv) can search for out-of-vocabulary words. Unlike previous approaches that operate at either character or word levels, we use character n-gram images (CNG-img) as the retrieval primitive. CNG-img are sequences of character segments, that are represented and matched in the image-space. The word-images are now treated as a bag-of-CNG-img, that can be indexed and matched in the feature space. This allows for recognition-free search (query-by-example), which can retrieve morphologically similar words that have matching sub-words. Further, to enable query-by-keyword, we build an automated scheme to generate labeled exemplars for characters and character n-grams, from unconstrained handwritten documents. We pose this problem as one of weakly-supervised learning, where character/n-gram labeling is obtained automatically from the word labels. The resulting retrieval system can answer queries from an unlimited. vocabulary. The approach is demonstrated on the George Washington collection, results show major improvement in retrieval performance as compared to word-recognition and word-spotting methods.
更多
查看译文
关键词
handwritten character recognition,image matching,image representation,image retrieval,image segmentation,CNG-img retrieval primitive,George Washington collection,character N-gram spotting,handwritten document image,handwritten documents,image retrieval system,image-space matching,image-space representation,out-of-vocabulary words,query-by-example,query-by-keyword,recognition-free system,sub-word level retrieval,text-querying,weakly-supervised segmentation,word-recognition method,word-spotting method
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要