CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation

PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023(2023)

引用 7|浏览48
暂无评分
摘要
General-purpose code generation aims to automatically convert the natural language description to code snippets in a general-purpose programming language (GPL) such as Python. In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of GPL. However, existing sequence-to-sequence (Seq2Seq) approaches neglect grammar rules when generating GPL code. In this paper, we devise a pushdown automaton (PDA)-based methodology to make the first attempt to consider grammatical Seq2Seq models for general-purpose code generation, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of Seq2Seq models to ensure grammatical correctness. Guided by this methodology, we further propose CODEP, a code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. This framework leverages the state of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CODEP, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CODEP can employ existing sequence-based models as base models, and we show that it achieves 100% grammatical correctness percentage on these benchmark datasets. Consequently, CODEP relatively improves 17% CodeBLEU on CONALA, 8% EM on DJANGO, and 15% CodeBLEU on JUICE-10K compared to base models. Moreover, PDA module also achieves significant improvements on the pre-trained models.
更多
查看译文
关键词
Code generation,Code intelligence,Seq2Seq,PDA,PL
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要