Overcoming the ordinal imbalanced data problem by combining data processing and stacked generalizations

Machine Learning with Applications(2022)

引用 0|浏览14
暂无评分
摘要
Ordinal imbalanced datasets are pervasive in real world applications but remain challenging to analyse as they require specific methods to account for the ordering information and imbalanced classes. Failure to account for both those characteristics can substantially impact the model predictive performance. However, existing methods tend to focus either on ordinality or imbalance, rather than addressing both simultaneously. The few approaches that do account for both characteristics are not always easy to implement for non-advanced analysts and simpler approaches are needed to facilitate appropriate data processing. Here, we developed a general approach using some of the most popular machine learning algorithms to ensure appropriate processing of ordinal imbalanced datasets and to optimize the predictions of all classes. After transforming the multi-class ordinal problem into a well-known binary problem, we implemented several different resampling methods in a decision-tree classifier. We then used a stacked generalization algorithm to combine the classifiers to improve model predictive performance. To test our approach, we used two ordinal imbalanced datasets on student performance and wine quality. Individual resampling techniques tended to improve the accuracy of minority classes, while simultaneously increasing the number of false positives in those classes. This resulted in a decrease, sometimes substantial, in accuracy of other classes. The stacking model offered a good compromise between improvement in accuracy of minority classes and mitigation of reduced accuracy in other classes. Our approach provided useful insights into modelling strategies that should be favoured for implementation in production that involve these common datasets, depending on the end-user interests.
更多
查看译文
关键词
Stacked generalizations,Machine learning,Ordinal data,Imbalanced data,Random forests,Resampling methods,Rare events,Classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要