A Statistical Approach to Retrieving Historical Manuscript Images without Recognition

msra(2003)

引用 29|浏览26
暂无评分
摘要
Handwritten historical document collections in libraries and other areas are often of interest to researchers, stu- dents or the general public. Convenient access to such cor- pora generally requires an index, which allows one to lo- cate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Several solutions are possible: manual annotation (very expensive), handwriting recognition (poor results) and word spotting - an image matching approach (computationally expensive). In this work, we present a novel retrieval approach for historical document collections, which does not require recognition. We assume that word images can be described using a vocabulary of discretized word features. From a training set of labeled word images, we extract discrete fea- ture vectors, and estimate the joint probability distribution of features and word labels. For a given feature vector (i.e. a word image), we can then calculate conditional probabil- ities for all labels in the training vocabulary. Experiments show that this relevance-based language model works very well with a mean average precision of 89% for 4-word queries on a subset of George Washington's manuscripts.
更多
查看译文
关键词
information retrieval,language model,indexation,matching,shape,history,mean average precision,automation,probability distribution,feature extraction,handwriting recognition,precision,feature vector
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要