Developing the Bangladeshi National Corpus- a Balanced and Representative Bangla Corpus

2019 International Conference on Sustainable Technologies for Industry 4.0 (STI)(2019)

引用 2|浏览0
暂无评分
摘要
The need for a balanced, representative national scale corpus has been skyrocketing for the already `low resource' tagged language-Bangla. Many sporadic empirical works have been done so far in the field of NLP and Computational Linguistics yet, and these are never enough. Moreover, none of these works can bear the best fruit without the help of a standard corpus. To address these issues, the goal of this research work was set to compile the Bangladeshi National Corpus (BDNC). This paper proposes the development process of the BDNC (first phase- Bangla monolingual corpus). In this work, the whole task was divided into three major phases, where the goal of the first phase is to build a representative monolingual corpus that will include at least 100 million Bangla words. Whereas, in the second phase, there will be a sub-corpora that will consist of a parallel corpus having 1 million words in Bangla and English. However, at the third and final phase, the parallel corpus will incorporate 15 foreign languages (including English) comprising a weighted corpus size of at least 15 million words.
更多
查看译文
关键词
Bangla,Corpus,balanced,representative,monolingual corpus,multi-lingual corpus,translation corpus,parallel corpus
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要