Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

NEUROSURGERY(2023)

引用 8|浏览38
暂无评分
摘要
BACKGROUND AND OBJECTIVES: Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.METHODS: The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. chi 2, Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.RESULTS: ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent (P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.CONCLUSION: LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.
更多
查看译文
关键词
Neurosurgery,Medical education,Surgical education,Residency education,Artificial intelligence,Large language models,ChatGPT,GPT-4
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要