Parallelizing String Similarity Join Algorithms

Ling-Chih Yao,Lipyeow Lim

DATABASES THEORY AND APPLICATIONS, ADC 2018(2018)

引用 0|浏览0
暂无评分
摘要
A key operation in data cleaning and integration is the use of string similarity join (SSJ) algorithms to identify and remove duplicates or similar records within data sets. With the advent of big data, a natural question is how to parallelize SSJ algorithms. There is a large body of existing work on SSJ algorithms and parallelizing each one of them may not be the most feasible solution. In this paper, we propose a parallelization framework for string similarity joins that utilizes existing SSJ algorithms. Our framework partitions the data using a variety of partitioning strategies and then executes the SSJ algorithms on the partitions in parallel. Some of the partitioning strategies that we investigate trade accuracy for speed. We implemented and validated our framework on several SSJ algorithms and data sets. Our experiments show that our framework results in significant speedup with little loss in accuracy.
更多
查看译文
关键词
String Similarity, Parallelization Framework, Data Partitioning Strategies, Maximum Item Size, Similar Record Pairs
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要