HHLD: Hateful posts Identification in Hindi Language leveraging multi task learning

IEEE Access(2023)

引用 0|浏览1
暂无评分
摘要
Hate speech is now a frequent occurrence on social media. Recently, the majority of study was devoted to identifying hate speech in languages with abundant resources (e.g., English). However, relatively few works are developed for languages with limited resources (e.g., Hindi, the third most widely used language on earth). In this study, Hindi Hate Language Dataset (HHLD) is created following a novel hierarchical fine-grained four-layer annotation approach. The top layer separates the posts into hateful and non-hateful categories. The second layer further categorises hateful posts into explicit hate and implicit hate. The third layer is the multilabel tagging of the post into topics, such as political, religious, racism, or sexism. The fourth layer involves the identification of the targeted named entity, either explicitly or implicitly. Additionally, a thorough evaluation of the data annotation schema for trustworthy annotation is provided. The HHLD data is the largest multi-layer annotated corpora in Hindi compared with the existing multi-layer annotated data. Experiments on the dataset using the transformer-based approaches in single-task learning (STL) attain encouraging performances in accuracy and weighted-f1 score. The experiment leveraged multitask learning (MTL) by including multiple related hate speech detection tasks from high-resource English and languages from the same linguistic family such as Urdu and Bangla with a transformer encoder as the shared layers to obtain a significant increment of 5.31% and 5.35% over STL in accuracy and weighted-f1 for layer A, 8.20%, and 22.83% for layer B. The MTL surpasses STL by 8.58% and 4.07% in exact match and hamming loss for layer C.
更多
查看译文
关键词
hateful posts identification,hindi language,learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要