Localized Symbolic Knowledge Distillation for Visual Commonsense Models
NeurIPS 2023(2023)
Abstract
Instruction following vision-language (VL) models offer a flexible
interface that supports a broad range of multimodal tasks in a zero-shot fashion.
However, interfaces that operate on full images do not directly enable the user to
“point to" and access specific regions within images. This capability is important
not only to support reference-grounded VL benchmarks, but also, for practical
applications that require precise within-image reasoning. We build Localized
Visual Commonsense model which allows users to specify (multiple) regions-
as-input. We train our model by sampling localized commonsense knowledge
from a large language model (LLM): specifically, we prompt a LLM to collect
commonsense knowledge given a global literal image description and a local
literal region description automatically generated by a set of VL models. This
pipeline is scalable and fully automatic, as no aligned or human-authored image
and text pairs are required. With a separately trained critic model that selects
high quality examples, we find that training on the localized commonsense corpus
expanded solely from images can successfully distill existing VL models to support
a reference-as-input interface. Empirical results and human evaluations in zero-shot
settings demonstrate that our distillation method results in more precise VL models
of reasoning compared to a baseline of passing a generated referring expression.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined