Benchmarks as Microscopes: A Call for Model Metrology

Michael Saxon,Ari Holtzman,Peter West, William Yang Wang, Naomi Saphra

arxiv(2024)

Cited 0|Views0
No score
Abstract
Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology – one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners – one focused on building tools and studying how to measure system capabilities – is the best way to meet these needs to and add clarity to the AI discussion.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined