Single-pass and linear-time k-means clustering based on MapReduce.

Inf. Syst.(2016)

引用 76|浏览50
暂无评分
摘要
In recent years, k-means has been fitted into the MapReduce framework and hence it has become a very effective solution for clustering very large datasets. However, k-means is not inherently suitable for execution in MapReduce. The iterative nature of k-means cannot be modeled in MapReduce and hence for each iteration of k-means an independent MapReduce job must be executed and this results in high I/O overhead because in each iteration the whole dataset must be read and written to slow disks. We have proposed a single-pass solution based on MapReduce called mrk-means which uses the reclustering technique. In contrast to available MapReduce-based k-means implementations, mrk-means just reads the dataset once and hence it is several times faster. The time complexity of mrk-means is linear which is lower than the iterative k-means. Due to usage of k-means++ seeding algorithm, mrk-means results in clusters with higher quality, too. Theoretically, the results of mrk-means are O ( log 2 k ) - competitive to optimal clustering in the worst case, considering k as the number of clusters. During our experiments which were done on a cluster of 40 machines running the Hadoop framework, mrk-means showed both faster execution times, and higher quality of clustering results compared to available MapReduce-based and stream-based k-means variants. Highlightsmrk-means is a novel clustering algorithm which is based on MapReduce.mrk-means is single-pass and linear-time.mrk-means results in clusters that are O ( log 2 k ) - competitive to optimal solution.mrk-means is both faster and more accurate than Apache Mahout and GraphLab k-means.
更多
查看译文
关键词
Distributed k-means,Data clustering,MapReduce-based clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要