Performance of ChatGPT in medical examinations: A systematic review and a meta-analysis

BJOG-AN INTERNATIONAL JOURNAL OF OBSTETRICS AND GYNAECOLOGY(2024)

引用 1|浏览0
暂无评分
摘要
The use of ChatGPT, an artificial intelligence (AI) language model, has been described in various scientific and medical applications.1 With its human-like conversation capacity and large quantity of data, ChatGPT has the potential to become an important medical education tool.2 ChatGPT's performance in different medical knowledge examinations has been recently studied in various medical disciplines. However, reported rates of correct answers vary dramatically across different examinations and medical fields.3, 4 We aimed to conduct a meta-analysis of studies reporting ChatGPT's performance in medical examinations with multiple-choice questions. PubMed, Scopus and Web of Science websites were searched for relevant articles from the inception of these databases to 2 June 2023, by use of the word “ChatGPT” (no equivalent Mesh term exists). We manually reviewed every article title and abstract. In case an abstract was not available, we accessed the abstract on the journal's web site. We included all peer-reviewed articles, assessing the performance (number of right answers/number of questions) of ChatGPT in multiple-choice questions in the field of medicine. Exclusion criteria were: (i) performance of ChatGPT not in English language; (ii) a study evaluating performance of ChatGPT in a setting other than multiple-choice question (e.g. open questions, frequently asked questions); and (iii) a study not reporting ChatGPT version 3.5. All review stages were conducted independently by two reviewers (RM and GL). Disagreements were resolved by discussion with a third reviewer (YB). Data were extracted from each included study without modifications and a database was constructed containing the study field of medicine (e.g. dermatology, plastic surgery etc.), cohort size (number of questions answered), number of correct questions, ChatGPT performance (number of correct answers/numbers of answered questions) and 95% CI for performance rate. We used MedCalc Statistical Software version 19.2.6 (MedCalc Software bv, Ostend, Belgium) and OpenMeta[Analyst] for the analysis. The process of literature search and article selection is presented in the Supplementary material. Finally, a total of 19 articles were included in the analysis. Two articles (11%) studied Plastic Surgery examinations, two (11%) studied the United Stated Medical Licensing Examinations (USMLE) and two (11%) studied anaesthesia examinations. All other publications studied different medical fields (Table 1). The median number of questions per examination was 242, ranging from 20 in medical physiology examinations to 3705 in anaesthesia examinations (mean 524.5 ± 847.2 standard deviation, Figure 1, Table 1). Overall performance of ChatGPT ranged from 40% in the biomedical admission test to 100% in a diabetes knowledge questionnaire. The mean performance of ChatGPT was 61.1% (95% CI 56.1%–66.0%. As the literature regarding the performance of ChatGPT in medical education is mounting, summarising the current performance of ChatGPT in medical examinations can provide an insight into the present and future of AI in medical education. Our meta-analysis suggests that ChatGPT correctly answered the majority of multiple-choice questions in medical examinations and demonstrated a performance of approximately a passing grade. Medical education, test preparation services, and medical examinations form a large industrial market.5 Currently, it seems that the use of ChatGPT for examination preparation should be prudent, and preparations for examinations using this platform should be done with the proper caution. A correct response rate of more than 95% may allow ChatGPT to become a reliable educational tool,6 and it is unknown whether future versions will reach this maturity level, as the training data set is not specifically developed with a focus on medical education. Our limitations are the inclusion of only ChatGPT version 3.5 studies, the heterogeneity of the included studies with unreported numbers of choices per examination, and the possible overestimation of the tool performance. Future meta-analyses may include future versions of AI chatbots to provide updated understanding of their role in medical education. None. This research received no external funding. None declared. No ethical approval was needed by the institutional review board as this study analyzed only published available public data and no human patients data was sused. The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. Data S1. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
更多
查看译文
关键词
artificial intelligenc,ChatGPT,education,examination,meta analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要