A Practical And Effective Sampling Selection Strategy For Large Scale Deduplication
IEEE Transactions on Knowledge and Data Engineering(2016)
摘要
Record deduplication aims at identifying entities that are potentially the same in a data repository. A set of pairs that is manually labeled is generally used to tune the deduplication process, as each dataset has a particular dirtiness pattern. However, producing an informative set of pairs is a very costly task, especially in very large datasets (even for expert users). We propose a new sampling strategy that is able to select a very small and informative set of pairs from large datasets. Our results show that our approach reduces user effort substantially while achieving a competitive or superior matching quality.
更多查看译文
关键词
Deduplication, signature-based deduplication
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络