Search Space Reduction for Holistic Ligature Recognition in Urdu Nastalique Script

Document Analysis and Recognition(2013)

引用 8|浏览2
暂无评分
摘要
This paper addresses the problem of holistic recognition of printed ligatures in Nastalique writing style of the Urdu language. The main difficulty of the recognition process lies in the large number of classes/ligatures (17,000 different possible ligatures in our Urdu text data). This large number of classes not only limits the efficiency (run-time) of the recognition algorithms, but also makes it difficult to use state-of-the-art classifiers - like Random Forests - that can only handle up to a few hundred classes. Nearest neighbor classifiers scale up well to tackle such large-scale classification problems, however their poor run-time efficiency poses a major obstacle. In this paper, we investigate two strategies for improving the efficiency (reducing the search space) of nearest neighbor based classification of Urdu ligatures. The first approach uses spectral hashing to resort to approximate nearest neighbor classification. The second approach is based on the idea of hierarchical classification to partition the search space based on the number of characters in a ligature. Experiments using spectral hashing show that the search space of nearest neighbor comparison can be reduced by about 50% without a loss in recognition accuracy. Further experiments demonstrate that the Random Forest classifier can be reliably used as the first stage classifier to distinguish one-character ligatures from multiple-character ligatures in a hierarchical classification scheme. We hope that the ideas presented in this paper would build the foundations for practical large-scale ligature classification systems not only for Nastalique, but also for other Urdu and Arabic scripts.
更多
查看译文
关键词
nearest neighbor classifier,large-scale classification problem,holistic ligature recognition,urdu nastalique script,approximate nearest neighbor classification,urdu language,hierarchical classification scheme,practical large-scale ligature classification,search space reduction,search space,large number,hierarchical classification,urdu ligature,decision trees,text analysis,natural language processing,information retrieval
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要