A Semi-Supervised Framework of Clustering Selection for De-Duplication

2019 IEEE 35th International Conference on Data Engineering (ICDE)(2019)

引用 16|浏览72
暂无评分
摘要
We view data de-duplication as a clustering problem. Recently, [1] introduced a framework called restricted correlation clustering (RCC) to model de-duplication problems. Given a set X, an unknown target clustering C* of X and a class F of clusterings of X, the goal is to find a clustering C from the set F which minimizes the correlation loss. The clustering algorithm is allowed to interact with a domain expert by asking whether a pair of records correspond to the same entity or not. Main drawback of the algorithm developed by [1] is that the pre-processing step had a time complexity of theta (|X|2) (where X is the input set). In this paper, we make the following contributions. We develop a sampling procedure (based on locality sensitive hashing) which requires a linear pre-processing time O(|X|). We prove that our sampling procedure can estimate the correlation loss of all clusterings in F using only a small number of labelled examples. In fact, the number of labelled examples is independent of |X| and depends only on the complexity of the class F. Further we show that to sample one pair, with high probability our procedure makes a constant number of queries to the domain expert. We then perform an extensive empirical evaluation of our approach which shows the efficiency of our method.
更多
查看译文
关键词
Correlation,Clustering algorithms,Optimization,Task analysis,Loss measurement,Databases,Standards
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要