Text Deduplication with Minimum Loss Ratio

Proceedings of the 2019 11th International Conference on Machine Learning and Computing(2019)

引用 1|浏览25
暂无评分
摘要
Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.
更多
查看译文
关键词
Text deduplication, minimum vertex cover, similarity graph
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要