Retrieval of handwritten historical document images

Retrieval of handwritten historical document images(2005)

引用 24|浏览7
暂无评分
摘要
Historical library collections across the world hold huge numbers of handwritten documents. By digitizing these manuscripts, their content can be preserved and made available to a large community via the Internet or other electronic media. Such corpora can nowadays be shared relatively easily, but they are often large, unstructured, and only available in image formats, which makes them difficult to access. In particular, finding specific locations of interest in a handwritten image collection is generally very tedious without some sort of index or other access tool. The current solution for this problem is to manually annotate a historical collection, which is very costly in terms of time and money. In this work we explore several automatic techniques that allow the retrieval of handwritten document images with text queries. These are (i) word spotting, an approach that clusters word images to identify and annotate content-bearing words in a collection, (ii) handwriting recognition followed by text retrieval, and (iii) cross-modal retrieval models, which capture the joint occurrence of annotations and word image features in a probabilistic model. We compare the performance of these approaches empirically on several test collections. The main contributions of this work are a detailed examination of retrieval approaches for historical manuscripts, and the development of the first image retrieval system for historical manuscripts that allows text queries. This system extends the field of digital libraries beyond machine printed text into historical handwritten documents. Building such a system involves challenges on numerous levels: the noisy historical manuscript domain requires adequate image filtering, normalization and representation techniques, as well as a robust and scalable retrieval framework. We describe the construction of a prototype system, which demonstrates the feasibility of the proposed techniques for a large collection of handwritten historical documents.
更多
查看译文
关键词
cross-modal retrieval model,historical library collection,retrieval approach,historical manuscript,handwritten historical document image,historical handwritten document,image retrieval system,text query,noisy historical manuscript domain,historical collection,handwritten historical document
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要