The Carabela Project and Manuscript Collection: Large-Scale Probabilistic Indexing and Content-based Classification

2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)(2020)

引用 13|浏览22
暂无评分
摘要
The main aim of the Carabela project was to develop and apply techniques that allow textual searching on massive Spanish collections of 15th-19th century manuscripts. The project focused on a relatively small subset of 125 000 images of collections of interest to underwater archaeology. For this type of manuscripts, state-of-the-art automatic transcription techniques, generally fail to achieve usable transcription accuracy. Therefore, rather than insisting in actual transcription, methodologies for probabilistic indexing of handwritten text images have been adopted. This has allowed us to effectively cope with the intrinsically high degree of uncertainty of the text contained in most historical manuscripts, leading to highly effective systems for textual search and retrieval. Carabela has gone one step further by developing new techniques to classify probabilistically indexed, but otherwise untranscribed, text images according to their textual content. These techniques have been successfully used to automatically classify Carabela bundels (each containing hundreds or thousands of pages) according to their “level of risk” of public exposure, in order to control their access and avoid as much as possible the plundering of Spanish underwater heritage.
更多
查看译文
关键词
Handwritten Text Images,Large-Scale Probabilistic Indexing,keyword Spotting,Content-based Image Classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要