The Carabela Project and Manuscript Collection: Large-Scale Probabilistic Indexing and Content-based Classification

2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)(2020)

Cited 13|Views24
No score
Abstract
The main aim of the Carabela project was to develop and apply techniques that allow textual searching on massive Spanish collections of 15th-19th century manuscripts. The project focused on a relatively small subset of 125 000 images of collections of interest to underwater archaeology. For this type of manuscripts, state-of-the-art automatic transcription techniques, generally fail to achieve usable transcription accuracy. Therefore, rather than insisting in actual transcription, methodologies for probabilistic indexing of handwritten text images have been adopted. This has allowed us to effectively cope with the intrinsically high degree of uncertainty of the text contained in most historical manuscripts, leading to highly effective systems for textual search and retrieval. Carabela has gone one step further by developing new techniques to classify probabilistically indexed, but otherwise untranscribed, text images according to their textual content. These techniques have been successfully used to automatically classify Carabela bundels (each containing hundreds or thousands of pages) according to their “level of risk” of public exposure, in order to control their access and avoid as much as possible the plundering of Spanish underwater heritage.
More
Translated text
Key words
Handwritten Text Images,Large-Scale Probabilistic Indexing,keyword Spotting,Content-based Image Classification
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined