Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
arxiv(2024)
Abstract
We introduce Vibe-Eval: a new open benchmark and framework for evaluating
multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts,
including 100 of hard difficulty, complete with gold-standard responses
authored by experts. Vibe-Eval is open-ended and challenging with dual
objectives: (i) vibe checking multimodal chat models for day-to-day tasks and
(ii) rigorously testing and probing the capabilities of present frontier
models. Notably, our hard set contains >50
answer incorrectly. We explore the nuances of designing, evaluating, and
ranking models on ultra challenging prompts. We also discuss trade-offs between
human and automatic evaluation, and show that automatic model evaluation using
Reka Core roughly correlates to human judgment. We offer free API access for
the purpose of lightweight evaluation and plan to conduct formal human
evaluations for public models that perform well on the Vibe-Eval's automatic
scores. We release the evaluation code and data, see
https://github.com/reka-ai/reka-vibe-eval
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined