Weak Labelling for File-level Source Code Classification

2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)(2023)

引用 0|浏览1
暂无评分
摘要
Software repository hosting services contain large amounts of open-source software, with GitHub hosting over 200 million repositories, from new to established ones. However, these repositories are not easy to find, calling for various attempts to classify their application domains automatically. However, most proposed approaches use artifacts, like README files, as a proxy for the project, losing the information in the source code and the interaction between files. Furthermore, they all focus on the project-level, ignoring the decomposition of software projects into components and modules.This work presents a weak labelling approach based on keyword extraction to annotate source files in a software project.Our findings suggest that using keywords to perform file-level annotations is an effective approach that can capture enough information from the source file so that new labels can be predicted.The long-term goal of our research is to classify source code files and use these annotations to identify semantic components in software projects. In addition, these annotations can be used for semantic reverse engineering, software reuse, and more. We plan to train machine learning models that use our proposed weak supervision to better annotate source files inside software projects.
更多
查看译文
关键词
software classification,software categories key-words,semantic reverse engineering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要