Scaling Record Linkage To Non-Uniform Distributed Class Sizes

PAKDD'08: Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining(2008)

引用 7|浏览29
暂无评分
摘要
Record linkage is a central task when information from different sources is integrated. Record linkage models use so-called blockers for reducing the search space by discarding obviously different record pairs. In practice, important problems have Zipf distributed class sizes with some large classes where blocking is not applicable any more. Therefore we propose two novel meta algorithms for scaling arbitrary record linkage models to such data sets. The first one parallelizes problems by creating overlapping subproblems and the second one reduces the search space for large classes effectively. Our evaluation shows that both scaling techniques are effective and are able to scale state-of-the-art models to challenging datasets.
更多
查看译文
关键词
large class,search space,arbitrary record linkage model,different record pair,record linkage,record linkage model,different source,scaling technique,central task,class size
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要