Towards Summarizing Code Snippets Using Pre-Trained Transformers
CoRR(2024)
摘要
When comprehending code, a helping hand may come from the natural language
comments documenting it that, unfortunately, are not always there. To support
developers in such a scenario, several techniques have been presented to
automatically generate natural language summaries for a given code. Most recent
approaches exploit deep learning (DL) to automatically document classes or
functions, while little effort has been devoted to more fine-grained
documentation (e.g., documenting code snippets or even a single statement).
Such a design choice is dictated by the availability of training data: For
example, in the case of Java, it is easy to create datasets composed of pairs
that can be fed to DL models to teach them how to summarize a
method. Such a comment-to-code linking is instead non-trivial when it comes to
inner comments documenting a few statements. In this work, we take all the
steps needed to train a DL model to document code snippets. First, we manually
built a dataset featuring 6.6k comments that have been (i) classified based on
their type (e.g., code summary, TODO), and (ii) linked to the code statements
they document. Second, we used such a dataset to train a multi-task DL model,
taking as input a comment and being able to (i) classify whether it represents
a "code summary" or not and (ii) link it to the code statements it documents.
Our model identifies code summaries with 84
to the documented lines of code with recall and precision higher than 80
Third, we run this model on 10k projects, identifying and linking code
summaries to the documented code. This unlocked the possibility of building a
large-scale dataset of documented code snippets that have then been used to
train a new DL model able to document code snippets. A comparison with
state-of-the-art baselines shows the superiority of the proposed approach.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要