Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.
Oral surgery, oral medicine, oral pathology and oral radiology(2024)
摘要
OBJECTIVES:This study aims to evaluate the correctness of the generated answers by Google Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing chatbots to decision-making clinical questions in the oral and maxillofacial surgery (OMFS) area.
STUDY DESIGN:A group of 3 board-certified oral and maxillofacial surgeons designed a questionnaire with 50 case-based questions in multiple-choice and open-ended formats. Answers of chatbots to multiple-choice questions were examined against the chosen option by 3 referees. The chatbots' answers to the open-ended questions were evaluated based on the modified global quality scale. A P-value under .05 was considered significant.
RESULTS:Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing answered 34%, 36%, 38%, 38%, and 26% of the questions correctly, respectively. In open-ended questions, GPT-4 scored the most answers evaluated as grades "4" or "5," and Bing scored the most answers evaluated as grades "1" or "2." There were no statistically significant differences between the 5 chatbots in responding to the open-ended (P = .275) and multiple-choice (P = .699) questions.
CONCLUSION:Considering the major inaccuracies in the responses of chatbots, despite their relatively good performance in answering open-ended questions, this technology yet cannot be trusted as a consultant for clinicians in decision-making situations.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要