Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING(2023)

引用 0|浏览2
暂无评分
摘要
With the increasing use of social media with its ability for users to share comments immediately, the extent of a system to identify offensive content has become a necessity in all languages. Due to the lack of publicly available resources on offensive language identification for Farsi, which has more than 110 million speakers, we present Pars-OFF, a three-layered annotated corpus for offensive language detection in Farsi to fill the existing gap. The introduced corpus contains 10,563 data samples. The tweets have been collected with a combination of similarity-based and keyword-based data selection techniques to avoid severe unbalancedness. Additionally, as a baseline, this article reports the performance of the traditional machine learning approaches and Transformer based models over the Pars-OFF dataset. The best performance was obtained by the BERT+fastText model, yielding the F1-Macro score of 89.57.
更多
查看译文
关键词
Abusive language detection,farsi language,farsi social media,offensive language detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要