The Impact of Unstated Norms in Bias Analysis of Language Models
arxiv(2024)
摘要
Large language models (LLMs), trained on vast datasets, can carry biases that
manifest in various forms, from overt discrimination to implicit stereotypes.
One facet of bias is performance disparities in LLMs, often harming
underprivileged groups, such as racial minorities. A common approach to
quantifying bias is to use template-based bias probes, which explicitly state
group membership (e.g. White) and evaluate if the outcome of a task, sentiment
analysis for instance, is invariant to the change of group membership (e.g.
change White race to Black). This approach is widely used in bias
quantification. However, in this work, we find evidence of an unexpectedly
overlooked consequence of using template-based probes for LLM bias
quantification. We find that in doing so, text examples associated with White
ethnicities appear to be classified as exhibiting negative sentiment at
elevated rates. We hypothesize that the scenario arises artificially through a
mismatch between the pre-training text of LLMs and the templates used to
measure bias through reporting bias, unstated norms that imply group membership
without explicit statement. Our finding highlights the potential misleading
impact of varying group membership through explicit mention in bias
quantification
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要