VariErr NLI: Separating Annotation Error from Human Label Variation
CoRR(2024)
Abstract
Human label variation arises when annotators assign different labels to the
same item for valid reasons, while annotation errors occur when labels are
assigned for invalid reasons. These two issues are prevalent in NLP benchmarks,
yet existing research has studied them in isolation. To the best of our
knowledge, there exists no prior work that focuses on teasing apart error from
signal, especially in cases where signal is beyond black-and-white. To fill
this gap, we introduce a systematic methodology and a new dataset, VariErr
(variation versus error), focusing on the NLI task in English. We propose a
2-round annotation scheme with annotators explaining each label and
subsequently judging the validity of label-explanation pairs. contains
7,574 validity judgments on 1,933 explanations for 500 re-annotated NLI items.
We assess the effectiveness of various automatic error detection (AED) methods
and GPTs in uncovering errors versus human label variation. We find that
state-of-the-art AED methods significantly underperform compared to GPTs and
humans. While GPT-4 is the best system, it still falls short of human
performance. Our methodology is applicable beyond NLI, offering fertile ground
for future research on error versus plausible variation, which in turn can
yield better and more trustworthy NLP systems.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined