Navigating the Future: Evaluating Artificial Intelligence Recommendations for Colonoscopy Screening Against Standardized Guidelines

Mohammed S. Al-Zakwani, Shuhaib Ali, Pir Shah,Juan Echavarria, Omer Shahab,Muhammad Haris,Bara El Kurdi

The American Journal of Gastroenterology(2023)

引用 0|浏览0
暂无评分
摘要
Introduction: The application of artificial intelligence (AI) in healthcare has seen rapid progression, with large language models (LLMs) emerging as essential tools due to their proficiency in assimilating unstructured clinical data and generating meaningful inferences. LLMs have demonstrated potential in executing routine clinical tasks. We aim to assess the aptitude of widely accessible LLMs to perform a specific clinical function: providing recommendations for polyp surveillance colonoscopy consistent with society guidelines. Methods: A cohort of 100 patients who underwent the most recent screening/surveillance colonoscopies at our institution were evaluated. LLMs, including GPT 3.5 & GPT 4 (chatGPT) & PaLM-1 & PaLM-2 (Google-BARD), were employed to determine the recommended interval for the subsequent surveillance exam following the 2020 multi-society task force guidelines for colorectal cancer (CRC). Patient age, sex, family history of CRC, colonoscopy findings, and pathology reports, were inputted into the respective LLM. Model-generated recommendations were then compared to those provided by a human expert. Inaccurate responses prompted further input until the correct answer was reached. The most efficacious model was additionally tested using an engineered prompt that necessitated quoting guidelines for each advisory. Results: GPT 3.5 achieved a 26% success rate, mandating 135 additional total prompts to attain 100% accuracy. Conversely, GPT 4 displayed 72% accuracy, requiring 44 extra prompts to reach 100%. PaLM-1 demonstrated a 27% accuracy rate, needing 109 supplementary prompts to achieve 100%, while PaLM-2 garnered a 36% accuracy rate with 105 additional prompts to reach 100%. Employing the engineered prompt, GPT 4 accomplished an 89% accuracy rate (Table 1). Conclusion: Our findings indicate that LLMs, particularly GPT 4, possess substantial potential in performing specific clinical tasks, boasting a 72% success rate in a zero-shot approach (single direct prompt) for determining the recommended timeframe for surveillance colonoscopy. The efficacy of these models can be optimized through prompt-engineering, resulting in an enhanced accuracy rate of 89%. Furthermore, both advanced versions of GPT and PaLM surpassed their predecessors. As these models continue to evolve, we anticipate that combining future iterations, engineered prompts, and fine-tuning based on clinical guidelines will considerably elevate overall performance to clinically acceptable standards. Table 1. - Study Results Artificial Intelligence Program Percentage Of Correct Responses After Only The Initial Prompt Was Provided Average Number Of Prompts Needed To Achieve Correct Response ChatGPT 3.5 26% 1.35 ChatGPT 4.0 72% 0.44 Bard PALM-1 27% 1.09 Bard PALM-2 36% 1.19
更多
查看译文
关键词
colonoscopy screening,artificial intelligence recommendations,artificial intelligence
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要