Top-k string similarity search with edit-distance constraints

Data Engineering(2013)

引用 82|浏览0
暂无评分
摘要
String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.
更多
查看译文
关键词
string matching,information retrieval,dynamic programming matrix,range-based method,top-k string similarity search,edit-distance threshold,dynamic programming algorithm,dynamic programming,edit-distance constraint,bioinformatics,top-k answer,data cleaning,query processing,query string,time complexity,indexes
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要