Learning Graph-based Code Representations for Source-level Functional Similarity Detection

ICSE(2023)

引用 1|浏览40
暂无评分
摘要
Detecting code functional similarity forms the basis of various software engineering tasks. However, the detection is challenging as functionally similar code fragments can be implemented differently, e.g., with irrelevant syntax. Recent studies incorporate program dependencies as semantics to identify syntactically different yet semantically similar programs, but they often focus only on local neighborhoods (e.g., one-hop dependencies), limiting the expressiveness of program semantics in modeling functionalities. In this paper, we present Tailor that explicitly exploits deep graph-structured code features for functional similarity detection. Given source-level programs, Tailor first represents them into code property graphs (CPGs) - which combine abstract syntax trees, control flow graphs, and data flow graphs - to collectively reason about program syntax and semantics. Then, Tailor learns representations of CPGs by applying a CPG-based neural network (CPGNN) to iteratively propagate information on them. It improves over prior work on code representation learning through a new graph neural network (GNN) tailored to CPG structures instead of the off-the-shelf GNNs used previously. We systematically evaluate Tailor on C and Java programs using two public benchmarks. Experimental results show that Tailor outperforms the state-of-the-art approaches, achieving 99.8% and 99.9% F-scores in code clone detection and 98.3% accuracy in source code classification.
更多
查看译文
关键词
functional similarity detection,code representations,graph-based,source-level
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要