A New Sparse Data Clustering Method Based On Frequent Items.

Proceedings of the ACM on Management of Data(2023)

引用 0|浏览24
暂无评分
摘要
Large, sparse categorical data is a natural way to represent complex data like sequences, trees, and graphs. Such data is prevalent in many applications, e.g., Criteo released a terabyte size click log data of 4 billion records with millions of dimensions. While most existing clustering algorithms like k-Means work well on dense, numerical data, there exist relatively few algorithms that can cluster sets of sparse categorical features. In this paper, we propose a new method called k-FreqItems that performs scalable clustering over high-dimensional, sparse data. To make clustering results easily interpretable, k-FreqItems is built upon a novel sparse center representation called FreqItem which will choose a set of high-frequency, non-zero dimensions to represent the cluster. Unlike most existing clustering algorithms, which adopt Euclidean distance as the similarity measure, k-FreqItems uses the popular Jaccard distance for comparing sets. Since the efficiency and effectiveness of k-FreqItems are highly dependent on an initial set of representative seeds, we introduce a new randomized initialization method, SILK, to deal with the seeding problem of k-FreqItems. SILK uses locality-sensitive hash (LSH) functions for oversampling and identifies frequently co-occurred data in LSH buckets to determine a set of promising seeds, allowing k-FreqItems to converge swiftly in an iterative process. Experimental results over seven real-world sparse data sets show that the SILK seeding is around 1.1\sim3.2× faster yet more effective than the state-of-the-art seeding methods. Notably, SILK scales up well to a billion data objects on a commodity machine with 4 GPUs. The code is available at https://github.com/HuangQiang/k-FreqItems.
更多
查看译文
关键词
sparse
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要