On the impact of multiple source code representations on software engineering tasks - An empirical study

JOURNAL OF SYSTEMS AND SOFTWARE(2024)

Cited 0|Views7
No score
Abstract
Efficiently representing source code is crucial for various software engineering tasks such as code classification and clone detection. Existing approaches primarily use Abstract Syntax Tree (AST), and only a few focus on semantic graphs such as Control Flow Graph (CFG) and Program Dependency Graph (PDG), which contain information about source code that AST does not. Even though some works tried to utilize multiple representations, they do not provide any insights about the costs and benefits of using multiple representations. The primary goal of this paper is to discuss the implications of utilizing multiple source code representations, specifically AST, CFG, and PDG. We modify an AST path -based approach to accept multiple representations as input to an attention -based model. We do this to measure the impact of additional representations (such as CFG and PDG) over AST. We evaluate our approach on three tasks: Method Naming, Program Classification, and Clone Detection. Our approach increases the performance on these tasks by 11% (F1), 15.7% (Accuracy), and 9.3% (F1), respectively, over the baseline. In addition to the effect on performance, we discuss timing overheads incurred with multiple representations. We envision that this work can provide a base for researchers to explore and experiment with a variety of source code representations for software engineering tasks.
More
Translated text
Key words
Source code representation,Abstract Syntax Tree,Control Flow Graph,Program Dependence Graph,Code embedding,Method naming
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined