CaLMQA: Exploring culturally specific long-form question answering across 23 languages
arxiv(2024)
Abstract
Large language models (LLMs) are commonly used for long-form question
answering, which requires them to generate paragraph-length answers to complex
questions. While long-form QA has been well-studied in English via many
different datasets and evaluation metrics, this research has not been extended
to cover most other languages. To bridge this gap, we introduce CaLMQA, a
collection of 2.6K complex questions spanning 23 languages, including
under-resourced, rarely-studied languages such as Fijian and Kirundi. Our
dataset includes both naturally-occurring questions collected from community
web forums as well as questions written by native speakers, whom we hire for
this purpose. Our process yields diverse, complex questions that reflect
cultural topics (e.g. traditions, laws, news) and the language usage of native
speakers. We conduct automatic evaluation across a suite of open- and
closed-source models using our novel metric CaLMScore, which detects incorrect
language and token repetitions in answers, and observe that the quality of
LLM-generated answers degrades significantly for some low-resource languages.
We perform human evaluation on a subset of models and see that model
performance is significantly worse for culturally specific questions than for
culturally agnostic questions. Our findings highlight the need for further
research in LLM multilingual capabilities and non-English LFQA evaluation.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined