Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity
CoRR(2023)
Abstract
This paper introduces innovative benchmarks to evaluate Vision-Language
Models (VLMs) in real-world zero-shot recognition tasks, focusing on the
granularity and specificity of prompting text. We propose a unique evaluation
protocol using adapted ImageNet and MS-COCO datasets to assess models'
consistency in recognizing concepts at varying granularity levels and their
sensitivity to the specificity of language inputs. Our extensive evaluation
reveals that state-of-the-art VLMs, including contrastive models like CLIP,
struggle with granularity and are sensitive to text specificity, impacting
their effectiveness in open-world settings. This comprehensive study, a first
in evaluating VLMs from these perspectives, provides valuable insights and
tools for the community, highlighting the limitations and paving the way for
enhanced models with better generalization in zero-shot recognition.
MoreTranslated text
Key words
recognition,granularity,zero-shot,vision-language
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined