Development of Part-of-Speech tagger for a low-resource endangered language

Toshal Gore,Vaibhav Khatavkar

2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)(2022)

引用 0|浏览0
暂无评分
摘要
India is one of the multilingual countries where large number of languages are spoken, major languages being Hindi, Bengali and Marathi. Indian languages have limited research done in the Natural Language Processing (NLP) domain. This is because Indian languages use Brahmic script alphabets, instead of Latin alphabets, which is very difficult for NLP to understand and process. Most of the Indian languages have many dialects and also have many distinct linguistic characteristics as compared to English. Also, there are many Indian languages which are on the verge of extinction and there is very little progress done on NLP for such languages in order to preserve them. The size of dataset available for such low resource languages is very small. One such language is Katkari, which is an endangered Indian tribal language, and a dialect of Marathi language. The purpose of this work is to develop a Part-of-Speech (POS) tagger for Katkari language. POS tagging is a technique in which each word in the text is assigned a POS label based on its context. The POS taggers for several Indian languages are developed, but for Katkari language, work is yet to be done. Hence, this paper presents a POS tagger for Katkari language which is built with the help of Hidden Markov Model (HMM) and Viterbi algorithm. The Katkari POS tagger was compared with POS taggers of other Indian languages and the accuracy of the Katkari POS tagger was obtained as 86.84%.
更多
查看译文
关键词
POS tagger,Hidden Markov Model,Viterbi algorithm,Machine Learning,NLP,Indian languages
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要