谷歌浏览器插件
订阅小程序
在清言上使用

Arabic Dialect Identification: Experimenting Pre-trained Models and Tools on Country-level Datasets.

Khloud Khaled, Tasneem Wael,Salma Khaled,Walaa Medhat

ACS/IEEE International Conference on Computer Systems and Applications(2023)

引用 0|浏览2
暂无评分
摘要
Arabic Dialect Identification (ADI) is the task of automatically detecting the regional dialect of the Arabic language from a given text or speech sample. It has gained significant attention due to the increasing demand for language-processing tasks in many applications such as social media analysis, content localization, and even dialectal text analysis. In this paper, an ADI system is built using a country-level Arabic dataset. The approach taken in this paper is to run a set of experiments using CAMeL Tools which is a machine learning-based model built for Arabic Natural Language Processing(NLP). Transformer-based models are tested such as the following BERT-based pre-trained models: AraBERT, CAMeLBERT and RoBERTa. In addition to GPT-based models: Alpaca and ChatGPT which have never been experimented before on a country-level dialectal dataset. This paper evaluates and discusses the performance of the mentioned models and tools on a dialectal Arabic dataset taking into consideration the base training data and model behavior. These experiments have shown that GPT-based models are considerably distant from effectively performing the dialect classification task. BERT-based models showed overfitting, and the fine-tuned Roberta almost exceeded the task-specific CAMeL BERT before fine-tuning. According to these findings, there is no qualified dialectal dataset to generalize over the task needed.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要