BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Abhik Bhattacharjee,Tahmid Hasan,Kazi Samin,Md Saiful Islam,Wasi Uddin Ahmad,Anindya Iqbal,M. Sohel Rahman,Rifat Shahriyar

The Annual Conference of the North American Chapter of the Association for Computational Linguistics（2021）

引用 73|浏览12

暂无评分

摘要

In this short paper, we introduce ‘BanglaBERT’, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed ‘Bangla2B+’) by crawling 110 popular Bangla sites. We introduce two new downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classiﬁcation, sequence labeling, and span prediction. In the process, we bring them under the ﬁrst-ever Bangla Language Understanding Evaluation (Ban-gLUE) benchmark. BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the BanglaBERT model, the new datasets, and a leaderboard publicly available at https: //github.com/csebuetnlp/banglabert to advance Bangla NLP.

查看译文

关键词

language understanding evaluation,language model pretraining,banglabert,low-resource

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要