Aligning Comments to News Articles on a Budget.

IEEE Access(2023)

引用 1|浏览17
暂无评分
摘要
Disagreement among text annotators as a part of a human (expert) labeling process produces noisy labels, which affect the performance of supervised learning algorithms for natural language processing. Using only high agreement annotations introduces another challenge: the data imbalance problem. We study this challenge within the problem of relating user comments to the content of a news article. We show that traditional techniques for learning from imbalanced data, such as oversampling, using weighted loss functions, or assigning weak labels using crowdsourcing, may not be sufficient for modeling complex temporal relationships between news articles and user comments. In this study, we propose a framework for aligning comments and articles 1) from imbalanced news data characterized with 2) different degrees of annotator agreement, under 3) a constrained budget for human labeling and computing resources. Within the framework, we propose a Semi-Automatic Labeling solution based on Human-AI collaboration. We compare our proposed technique with traditional data imbalance handling techniques and synthetic data generation on the article-comment alignment problem, where the goal is to determine a category of an article-comment pair that represents how relevant the comment is to the article. Finding an effective and efficient solution is essential because it is time-consuming and prohibitively costly to manually label a sufficiently large amount of article-comment pairs based on the semantic understanding of an article and its comments. We discover that the Human-AI collaboration outperforms all alternative techniques by 17% of article-comment alignment accuracy. When there is no time or budget for re-labeling some article-comment pairs, we found that synonym augmentation is a reasonable alternative. We also provide a detailed analysis of the effect of humans in the loop and the use of unlabeled data.
更多
查看译文
关键词
Labeling,Annotations,Synthetic data,Noise measurement,Crowdsourcing,Collaboration,Classification,Annotators' disagreement,article-comment alignment,imbalance classes,multi-class classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要