Understanding and Mitigating Label Bias in Malware Classification: An Empirical Study.

QRS(2022)

引用 1|浏览25
暂无评分
摘要
Machine learning techniques are promising for malware classification, but there is a neglected problem of label bias in the annotation process which decreases the performance in practice. To understand the label bias problems and existing solutions, we conduct an empirical study based on two Portable Executable (PE) malware sample datasets (i.e., open-sourced BODMAS with 52,793 samples and a new collected MAIN dataset of 153,811 samples), and 67 anti-virus engines in VirusTotal. We first show the two ways of label bias problems, including chaotic naming rules and annotation inconsistency. Then we present the effects of two solutions (i.e., electing one reputable AV engine and aggregating multiple labels based on majority voting) and find they face the problems of feature preference and engine independence. Finally, we propose some recommendations for improvements and get a 7.79% increase in the F1 score (i.e., from 84.83% to 92.62%). The dataset will be open-source for further study.
更多
查看译文
关键词
Malware Classification,machine learning,annotation bias
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要