Gaining insights in datasets in the shade of "garbage in, garbage out" rationale: Feature space distribution fitting

WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY(2022)

引用 6|浏览1
暂无评分
摘要
This article emphasizes comprehending the "Garbage In, Garbage Out" (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high-dimensional binary feature space. The results showed that the distributions fit well into two of the four long right-tail statistical distributions: log-normal, exponential, power law, and Poisson. Precisely, log-normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well-formed statistical methods provides a clear understanding of the datasets and intra-class and inter-class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand. This article is categorized under: Technologies > Data Preprocessing Technologies > Classification Technologies > Machine Learning
更多
查看译文
关键词
binary classification, data preprocessing, data profiling, data quality, machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要