Publicly Available Generative Artificial Intelligence Programs Are Currently Unsuitable for Performing Meta-Analyses

Abraham Z. Cheloff,Mark Pochapin,Violeta Popov

American Journal of Gastroenterology（2023）

引用 0|浏览5

暂无评分

摘要

Introduction: Generative Artificial intelligence (GenAI) has grasped the public’s attention since late 2022 when ChatGPT was introduced, with new daily applications for clinical medicine, research, and more. After completing a recent meta-analysis on the accuracy of visual estimation for measuring the size of colorectal polyps, we sought to explore whether commercially available GenAI programs are capable of replicating our work. Methods: 30 tasks needed to complete a meta-analysis were created and split evenly between Bloom’s 2021 taxonomy of questions (5 questions per type). Expected answer types and answer content were pre-defined based on our previous meta-analysis query of MEDLine and Embase. Each question was entered into two publicly available GenAI systems ChatGPT-3 (OpenAI, Inc.) and Bard (Google, LLC.) 5 times in separate contexts on May 30-31, 2023. Answers were graded on if they met the expected answer type and the expected answer content. The frequency of concordance was calculated using SPSS, and the 2 GenAI models tested were compared with chi-square (P< 0.05 for significance). Results: Bard was more successful at giving the expected answer types and was comparable or statistically superior to ChatGPT in answer content across all question and answer types (Table 1). Both GenAI systems showed relatively higher accuracy when displaying data (“Remember”) and applying it to a new situation (“Apply”). The lowest accuracy was in analysis and creating new information. On qualitative analysis, Chat-GPT tended to substitute explanations for other answer types. Both models would substitute raw knowledge or a vague answer when it was unable to perform an expected task, often creating hallucinations. Bard was more consistent in its answers, whether right or wrong, than Chat-GPT, which would give different answers to the same question. Chat-GPT was able to calculate mathematical answers from given data. Bard was able to explain how one would calculate the mathematical problem, but in each instance made data errors in calculations. Conclusion: GenAI, specifically Large Language Models (LLMs), is a powerful new tool that is quickly expanding to new applications. While these models can display the answer types one would expect, their content is variable at best, and often incorrect or a hallucination. Combined with the inability to synthesize data, these models are currently unsuitable for assisting in meta-analyses, but the landscape is quickly changing. Table 1. - Accuracy of ChatGPT-3 and Bard in Completing Tasks need to perform a meta-analysis Production of Expected Answer Type Answer Content Accuracy by Question Type Answer Content Accuracy by Answer Type ChatGPT-3 Bard P-value ChatGPT-3 Bard P-value ChatGPT-3 Bard P-value Overall 82% 100% < 0.001* Overall 27% 60% < 0.001* Explanations 100% 100% 1 Remember 40% 76% 0.01* Explanations 26% 64% < 0.001* Lists 65% 100% 0.004* Understand 20% 96% < 0.001* Lists 40% 60% 0.21 Binary 53% 100% 0.003* Apply 48% 48% 1 Binary 27% 80% 0.004* Data Synthesis 64% 100% 0.001* Analyze 12% 44% 0.01* Data Synthesis 20% 12% 0.45 Evaluate 40% 72% 0.02* Create 0% 20% 0.02* *P<0.05 for Chi-square of ChatGPT-3 vs Bard.

查看译文

关键词

artificial intelligence,meta-analyses

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要