Scaling Automatic Extraction of Pseudocode
arxiv(2024)
摘要
Pseudocode in a scholarly paper provides a concise way to express the
algorithms implemented therein. Pseudocode can also be thought of as an
intermediary representation that helps bridge the gap between programming
languages and natural languages. Having access to a large collection of
pseudocode can provide various benefits ranging from enhancing algorithmic
understanding, facilitating further algorithmic design, to empowering NLP or
computer vision based models for tasks such as automated code generation and
optical character recognition (OCR). We have created a large pseudocode
collection by extracting nearly 320,000 pseudocode examples from arXiv papers.
This process involved scanning over 2.2 million scholarly papers, with 1,000
of them being manually inspected and labeled. Our approach encompasses an
extraction mechanism tailored to optimize the coverage and a validation
mechanism based on random sampling to check its accuracy and reliability, given
the inherent heterogeneity of the collection. In addition, we offer insights
into common pseudocode structures, supported by clustering and statistical
analyses. Notably, these analyses indicate an exponential-like growth in the
usage of pseudocodes, highlighting their increasing significance.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要