Cluster Metric Sensitivity to Irrelevant Features
CoRR(2024)
摘要
Clustering algorithms are used extensively in data analysis for data
exploration and discovery. Technological advancements lead to continually
growth of data in terms of volume, dimensionality and complexity. This provides
great opportunities in data analytics as the data can be interrogated for many
different purposes. This however leads challenges, such as identification of
relevant features for a given task. In supervised tasks, one can utilise a
number of methods to optimise the input features for the task objective (e.g.
classification accuracy). In unsupervised problems, such tools are not readily
available, in part due to an inability to quantify feature relevance in
unlabeled tasks. In this paper, we investigate the sensitivity of clustering
performance noisy uncorrelated variables iteratively added to baseline datasets
with well defined clusters. We show how different types of irrelevant variables
can impact the outcome of a clustering result from k-means in different ways.
We observe a resilience to very high proportions of irrelevant features for
adjusted rand index (ARI) and normalised mutual information (NMI) when the
irrelevant features are Gaussian distributed. For Uniformly distributed
irrelevant features, we notice the resilience of ARI and NMI is dependent on
the dimensionality of the data and exhibits tipping points between high scores
and near zero. Our results show that the Silhouette Coefficient and the
Davies-Bouldin score are the most sensitive to irrelevant added features
exhibiting large changes in score for comparably low proportions of irrelevant
features regardless of underlying distribution or data scaling. As such the
Silhouette Coefficient and the Davies-Bouldin score are good candidates for
optimising feature selection in unsupervised clustering tasks.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要