The Comparative Diagnostic Capability of Large Language Models in Otolaryngology

Akshay Warrier, Rohan Singh,Afash Haleem, Haider Zaki,Jean Anderson Eloy

LARYNGOSCOPE(2024)

引用 0|浏览0
暂无评分
摘要
Objectives: Evaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology. Methods: We collected all 100 clinical vignettes from the second edition of Otolaryngology Cases-The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt "Provide a diagnosis given the following history," we prompted ChatGPT-3.5, Google Bard, and Bing-GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023. Results: ChatGPT-3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi-squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non-contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT-3.5, 88.17% for Google Bard, and 78.72% for Bing-GPT4 (p = 0.002). Conclusions: ChatGPT-3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing-GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible "hallucinations" and misinformation in responses.
更多
查看译文
关键词
artificial intelligence,Bing AI,ChatGPT,diagnostic accuracy,Google Bard,large language models,otolaryngology
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要