LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish.

Cedric Lothritz,Bertrand Lebichot,Kevin Allix,Lisa Veiber,Tegawendé F. Bissyandé,Jacques Klein,Andrey Boytsov,Clément Lefebvre,Anne Goujon

International Conference on Language Resources and Evaluation (LREC)（2022）

引用 4|浏览4

暂无评分

摘要

Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.

查看译文

关键词

Language Models, Less-Resourced Languages, NLP Datasets

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要