Roman Urdu toxic comment classification

Language Resources and Evaluation(2021)

引用 8|浏览19
暂无评分
摘要
With the increasing popularity of user-generated content on social media, the number of toxic texts is also on the rise. Such texts cause adverse effects on users and society at large, therefore, the identification of toxic comments is a growing need of the day. While toxic comment classification has been studied for resource-rich languages like English, no work has been done for Roman Urdu despite being a widely used language on social media in South Asia. This paper addresses the challenge of Roman Urdu toxic comment detection by developing a first-ever large labeled corpus of toxic and non-toxic comments. The developed corpus, called RUT (Roman Urdu Toxic), contains over 72 thousand comments collected from popular social media platforms and has been labeled manually with a strong inter-annotator agreement. With this dataset, we train several classification models to detect Roman Urdu toxic comments, including classical machine learning models with the bag-of-words representation and some recent deep models based on word embeddings. Despite the success of the latter in classifying toxic comments in English, the absence of pre-trained word embeddings for Roman Urdu prompted to generate different word embeddings using Glove, Word2Vec and FastText techniques, and compare them with task-specific word embeddings learned inside the classification task. Finally, we propose an ensemble approach, reaching our best F1-score of 86.35%, setting the first-ever benchmark for toxic comment classification in Roman Urdu.
更多
查看译文
关键词
Roman Urdu, Toxic comment classification, Deep learning, Roman Urdu toxic comments, Deep ensemble
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要