How to select samples for active learning? Document clustering with active learning methodology

2023 27th International Conference on Engineering of Complex Computer Systems (ICECCS)(2023)

引用 0|浏览0
暂无评分
摘要
In this paper, we investigate the applicability of the Active Learning technique to text clustering and topic modeling tasks. The aforementioned problems are often a non-trivial task due to the ambiguity of text similarity meaning. In our experiments, we implemented the Active Learning approach using automatic annotation from datasets with prepared labels. In a simulated study conducted on Polish and English datasets, we show how labeling a relatively small carefully selected number of examples can improve the quality of clustering relative to approaches based on a general notion of text similarity. We compare a number of techniques for selecting samples for labeling, dimensionality reduction and training approaches in order to compare and obtain the best quality of the resulting clusters with a minimum number of annotations. The obtained results show that with a relatively simple approach it is possible to obtain good quality clusters and thus develop classification ontologies in a data-centric approach.
更多
查看译文
关键词
active learning,document clustering,natural language processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要