Phishing Website Detection Based on Hybrid Resampling KMeansSMOTENCR and Cost-Sensitive Classification

Jaya Srivastava,Aditi Sharan

Advances in Cognitive Science and Communications(2023)

引用 1|浏览4
暂无评分
摘要
In many real-world scenarios such as fraud detection, phishing website classification, etc., the training datasets normally have skewed class distribution with majority (e.g., legitimate websites) class samples overwhelming the minority (e.g., phishing websites) class samples. The machine learning algorithms assume balanced class distributions and are biased towards the majority (uninteresting) class ignoring the minority (interesting) class (es). For handling class imbalance, researchers have proposed solutions both at the (i) data-level and (ii) algorithm-level. In this study we propose a dual approach for handling class imbalance in phishing website classification both at the data and algorithm. We propose a novel hybrid resampling approach KMeansSMOTENCR which balances the dataset by first oversampling the minority class using KMeans Synthetic Minority Oversampling Technique (KMeansSMOTE) (Douzas et al. in Inf Sci 465:1–20, 2018 [1]) followed by Neighborhood Clearing Rule (NCR) (Laurikkala in AIME, LNAI 2001. Springer, Berlin, pp 63–66, 2001 [2]) under sampling technique as the data cleaning approach to take care of the possibility of synthetic minority class samples invading the majority class samples. Finally, we employed Cost-Sensitive Random Forest (CS-RF), Cost-Sensitive Extreme Gradient Boosting (CS-XGB), Cost-Sensitive Support Vector Machine (CS-SVM), and Cost-Sensitive Logistic Regression (CS-LR) classifiers as algorithm-level balancing approach. We evaluated the performance of CS-RF, CS-XGBoost, CS-SVM, and CS-LR classifiers on (i) Original-(Imbalanced), (ii) NCR-(Balanced), (iii) KMeansSMOTE-(Balanced), and (iv) KMeansSMOTENCR-(Balanced) datasets. In Sect. 4 Result and Discussion we demonstrate that the highest ROC_AUC, F1 and GMean are obtained from our proposed method which outperforms the other three. To the best of our knowledge and belief our novel hybrid resampling approach ‘KMeansSMOTENCR’ has not been published in the existing studies as of now.
更多
查看译文
关键词
website detection,hybrid resampling kmeanssmotencr,classification,cost-sensitive
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要