Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING(2023)
摘要
With the increasing use of social media with its ability for users to share comments immediately, the extent of a system to identify offensive content has become a necessity in all languages. Due to the lack of publicly available resources on offensive language identification for Farsi, which has more than 110 million speakers, we present Pars-OFF, a three-layered annotated corpus for offensive language detection in Farsi to fill the existing gap. The introduced corpus contains 10,563 data samples. The tweets have been collected with a combination of similarity-based and keyword-based data selection techniques to avoid severe unbalancedness. Additionally, as a baseline, this article reports the performance of the traditional machine learning approaches and Transformer based models over the Pars-OFF dataset. The best performance was obtained by the BERT+fastText model, yielding the F1-Macro score of 89.57.
更多查看译文
关键词
Abusive language detection,farsi language,farsi social media,offensive language detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要