Benchmarks as Microscopes: A Call for Model Metrology
arxiv(2024)
Abstract
Modern language models (LMs) pose a new challenge in capability assessment.
Static benchmarks inevitably saturate without providing confidence in the
deployment tolerances of LM-based systems, but developers nonetheless claim
that their models have generalized traits such as reasoning or open-domain
language understanding based on these flawed metrics. The science and practice
of LMs requires a new approach to benchmarking which measures specific
capabilities with dynamic assessments. To be confident in our metrics, we need
a new discipline of model metrology – one which focuses on how to generate
benchmarks that predict performance under deployment. Motivated by our
evaluation criteria, we outline how building a community of model metrology
practitioners – one focused on building tools and studying how to measure
system capabilities – is the best way to meet these needs to and add clarity
to the AI discussion.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined