Parallel SLINK for big data

Poonam Goyal,Sonal Kumari,Sumit Sharma,Sundar Balasubramaniam,Navneet Goyal

International Journal of Data Science and Analytics（2019）

引用 5|浏览49

暂无评分

摘要

The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be clustered in reasonable amount of time. Also, due to its agglomerative merging process, each iteration depends on the data of all previous iterations, making it difficult to parallelize. Thus, there is a need for an efficient parallel implementation of SLINK algorithm which can scale to big data. We present a parallel SLINK algorithm, sGrid SLINK, for shared memory architectures. sGrid SLINK produces exactly the same dendrogram as the classical SLINK algorithm. We also present, hGrid SLINK, a parallel algorithm which fully exploits a multi-core cluster system. To the best of our knowledge, there is no hybrid parallel algorithm for SLINK available in the literature. The proposed algorithms exploit spatial locality of data to reduce the number of distance calculations. Adaptive gridding is used to counter skewness in data and to ensure load balancing. Extensive experiments are carried out to establish the efficiency and scalability of proposed parallel algorithms. sGrid SLINK is approximately 840 times faster than the state-of-the-art algorithm using 55 threads on a 48-core machine on a real dataset having 6 million data points. It also achieves a speedup of 47.93 over the best known sequential SLINK, Grid SLINK, on a real dataset using 48 threads on a 48-core machine. hGrid SLINK achieves a maximum speedup of 68.26 on a 32-node cluster ( 32× 4 processing elements) with respect to Grid SLINK. The hGrid SLINK algorithm is able to cluster 200 million data points in only 1317 s (less than 22 min). No existing parallel SLINK algorithm is capable of such efficient clustering of Big Data.

查看译文

关键词

Parallel clustering algorithms, Big data, SLINK, R-tree, Adaptive gridding

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要