谷歌Chrome浏览器插件
订阅小程序
在清言上使用

A Chi-Square Dissimilarity Measure for Clustering Categorical Datasets

Communications in computer and information science(2023)

引用 0|浏览5
暂无评分
摘要
Currently, there has been a high use of databases of large proportionalities. In addition, in using these data, there has been an enormous increase in using categorical data, specifically in using new alternatives to identify the most relevant items. In this order of ideas, cluster analysis is a relevant approach for the processing of categorical data. However, different machine learning models that have been proposed in the literature have problems to interpret categorical variables because of their high dimensionality and data overlapping, which can cause high computational cost or low performance in the algorithms. For this reason, we propose an unsupervised method using the C-S (Chi-Square) dissimilarity measure mapping from a categorical to a continuous Euclidean space, allowing an adequate interpretation of the k-means algorithm with the squared Euclidean distance. Furthermore, the proposed method was compared with other state-of-the-art techniques in unsupervised learning for categorical data such as: k-means, Mkm-nof, weighted dissimilarity, Mkm-ndm and structure-based clustering (SBC) algorithms; evaluated the accuracy (AC), adjusted rand index (ARI) and normalized mutual information (NMI). The results we present in the proposal outperform the clustering methods in the different evaluation methods on the 9 databases worked, for example, the (AC) of our Kmeans (C-S) method presented on the whole dataset is 0. 8090, in SBC1 of 0.7907, SBC2 of 0.7820, k-modes of 0.6979, W-D of 0.6949, Mkm-not of 0.6906 and Mkm-ndm of 0.7254 demonstrating superiority not only in the AC, but also in NMI and ARI. On the other hand, computational time was an issue of great relevance in our proposal because of the results got, in them it can be interpreted that the Kmeans (C-S) method in 8 of the 9 databases takes less than half the time of the other algorithms executing its model.
更多
查看译文
关键词
chi-square
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要