Search for Hyphenated Words in Probabilistic Indices: A Machine Learning Approach.

ICDAR (1)(2023)

引用 0|浏览2
暂无评分
摘要
Hyphenated words are one of the most common challenges in historical handwritten documents. For information retrieval, users issue an entire-word query and expect to retrieve all occurrences of this word, including the hyphenated ones. Thus, methods for predicting hyphenated word fragments and joining them must be developed. In this paper, we build upon and extend the work of Vidal and Toselli (2021) based on probabilistic indexing. We propose a new probabilistic framework to merge prefix/suffix word fragments into “combined spots”, searchable through entire-word queries, and assess different techniques to estimate the corresponding relevance (or “spotting”) probabilities. Additionally, we also consider the use of a hyphenation tool to join these text fragments at query time. We discuss the obtained retrieval results and storage cost using either probabilistic indices or plain automatic 1-best transcripts. The results show that it is possible to train a machine-learning system to join prefix/suffix word fragments automatically, with good information retrieval performance and reasonable storage usage.
更多
查看译文
关键词
hyphenated words,probabilistic indices,machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要