Evaluating ChatGPT-4's accuracy in identifying final diagnoses within differential diagnoses compared to those of physicians: an experimental study for diagnostic cases (Preprint)

Takanobu Hirosawa,Yukinori Harada,Kazuya Mizuta,Tetsu Sakamoto,Kazuki Tokumasu,Taro Shimizu

JMIR Formative Research（2024）

引用 0|浏览5

暂无评分

摘要

BACKGROUND The potential of artificial intelligence (AI) chatbots, particularly the fourth-generation chat generative pretrained transformer (ChatGPT-4), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential-diagnosis lists. OBJECTIVE This study aimed to assess the capability of ChatGPT-4 in identifying the final diagnosis from differential-diagnosis lists, and to compare its performance with that of physicians, for case report series. METHODS We utilized a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by three artificial intelligence (AI) systems: ChatGPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 chatbot. The primary outcome was focused on whether ChatGPT-4's evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, two independent physicians also evaluated the lists, with any inconsistencies resolved by another physician. RESULTS Three AIs generated a total of 1,176 differential diagnosis lists from 392 case descriptions. ChatGPT-4's evaluations concurred with those of the physicians in 966 out of 1,176 lists (82.1%). The Cohen kappa coefficient was 0.63 (95% confidence interval: 0.56-0.69), indicating a fair to good agreement between ChatGPT-4 and the physicians' evaluations. CONCLUSIONS ChatGPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential-diagnosis lists with final diagnoses suggests its potential in aiding clinical decision-making support through diagnostic feedback. While ChatGPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process. CLINICALTRIAL Not applicable

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要