A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server.
CIKM(2017)
摘要
Recent benchmark studies show that MPI-based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other implementations such as apache Spark etc. However, the communication cost of MPI DBSCAN increases drastically with the number of processors, which makes it inefficient for large scale problems.
In this paper, we propose PS-DBSCAN, a parallel DBSCAN algorithm that combines the disjoint-set data structure and Parameter Server framework, to minimize communication cost. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs large communication costs. In our algorithm, we employ a fast global union approach to union the disjoint-sets to alleviate the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency. We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud.
更多查看译文
关键词
Density-Based clustering, Parallel DBSCAN, Parameter Server
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络