MassJoin: A mapreduce-based method for scalable string similarity joins
ICDE(2014)
摘要
String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
更多查看译文
关键词
mapreduce-based method,transmission cost reduction,scalable algorithm,massjoin,string matching,big data,linear complexity,character-based similarity functions,computational complexity,mapreduce-based framework,large-scale string similarity join,cubic complexity,light-weight filter units,scalable string similarity joins,set-based similarity functions,data integration,cost reduction,key-value pairs,partition-based signature scheme,erbium,open systems,filtering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络