RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain
CoRR(2024)
摘要
Large Language Models (LLMs) increasingly support applications in a wide
range of domains, some with potential high societal impact such as biomedicine,
yet their reliability in realistic use cases is under-researched. In this work
we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA)
framework and evaluate whether four state-of-the-art foundation LLMs can serve
as reliable assistants in the biomedical domain. We identify prompt robustness,
high recall, and a lack of hallucinations as necessary criteria for this use
case. We design shortform tasks and tasks requiring LLM freeform responses
mimicking real-world user interactions. We evaluate LLM performance using
semantic similarity with a ground truth response, through an evaluator LLM.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要