NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
arxiv(2024)
摘要
Large language models (LLMs) have manifested strong ability to generate codes
for productive activities. However, current benchmarks for code synthesis, such
as HumanEval, MBPP, and DS-1000, are predominantly oriented towards
introductory tasks on algorithm and data science, insufficiently satisfying
challenging requirements prevalent in real-world coding. To fill this gap, we
propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror
the complexity and variety of scenarios in real coding tasks. NCB comprises 402
high-quality problems in Python and Java, meticulously selected from natural
user queries from online coding services, covering 6 different domains. Noting
the extraordinary difficulty in creating testing cases for real-world queries,
we also introduce a semi-automated pipeline to enhance the efficiency of test
case construction. Comparing with manual solutions, it achieves an efficiency
increase of more than 4 times. Our systematic experiments on 39 LLMs find that
performance gaps on NCB between models with close HumanEval scores could still
be significant, indicating a lack of focus on practical code synthesis
scenarios or over-specified optimization on HumanEval. On the other hand, even
the best-performing GPT-4 is still far from satisfying on NCB. The evaluation
toolkit and development set are available at
https://github.com/THUDM/NaturalCodeBench.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要