AAPFE: Aligned Assembly Pre-Training Function Embedding for Malware Analysis

Hairen Gui,Ke Tang,Zheng Shan,Meng Qiao,Chunyan Zhang,Yizhao Huang,Fudong Liu

ELECTRONICS（2022）

引用 0|浏览3

暂无评分

摘要

The use of natural language processing to analyze binary data is a popular research topic in malware analysis. Embedding binary code into a vector is an important basis for building a binary analysis neural network model. Current solutions focus on embedding instructions or basic block sequences into vectors with recurrent neural network models or utilizing a graph algorithm on control flow graphs or annotated control flow graphs to generate binary representation vectors. In malware analysis, most of these studies only focus on the single structural information of the binary and rely on one corpus. It is difficult for vectors to effectively represent the semantics and functionality of binary code. Therefore, this study proposes aligned assembly pre-training function embedding, a function embedding scheme based on a pre-training aligned assembly. The scheme creatively applies data augmentation and a triplet network structure to the embedding model training. Each sub-network extracts instruction sequence information using the self-attention mechanism and basic block graph structure information with the graph convolution network model. An embedding model is pre-trained with the produced aligned assembly triplet function dataset and is subsequently evaluated against a series of comparative experiments and application evaluations. The results show that the model is superior to the state-of-the-art methods in terms of precision, precision ranking at top N (p@N), and the area under the curve, verifying the effectiveness of the aligned assembly pre-training and multi-level information extraction methods.

查看译文

关键词

malware analysis, function embedding, aligned assembly, self-attention, graph convolution network

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要