Tibetan-BERT-wwm: A Tibetan Pretrained Model With Whole Word Masking for Text Classification

Yatao Liang,Hui Lv,Yan Li, La Duo, Chuanyi Liu,Qingguo Zhou

IEEE Transactions on Computational Social Systems（2024）

引用 0|浏览0

暂无评分

摘要

Social networks contributed massive text data generated by users in it, which were crucial in information explosion. These unstructured and ambiguous expressed data may result in difficulty in obtaining available contextual information from it, which can help us gain more accurate insights into user-generated content, user preferences, and topic dynamics within social networks. By training on a large-scale unsupervised corpus and fine-tuning parameters using a limited amount of supervised data, the pretrained language model can effectively capture rich contextual information and achieve excellent performance in numerous downstream tasks of natural language processing (NLP). For the low-resource language such as Tibetan, the distributed representation results of dynamic changes obtained from pretrained language models can effectively alleviate the problem of insufficient labeled data. In order to achieve more effective contextual information and word-level semantic information in Tibetan social media, we collected a large amount of Tibetan language corpus and trained a Tibetan pretrained language model, named as Tibetan-BERT-wwm, by using the whole word masking strategy. Additionally, we apply the model to analyze textual data from social networks to assess its efficacy in capturing user sentiments and news topic in Tibetan social media. In this study, accuracy, precision, recall, and F1 score were used to evaluate its performance. The results showed that the macro-F1 of Tibetan-BERT-wwm in the public dataset TNCC document and title are 75.55% and 64.17%, the self-built sentiment analysis dataset is 70.98%. Compared with other pretrained language models, the Tibetan-BERT-wwm model can capture the semantic information of Tibetan well and improve the Tibetan classification effect.

查看译文

关键词

Natural language processing (NLP),pretrained model,text classification,Tibetan

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要