Discriminative Probing and Tuning for Text-to-Image Generation
CVPR 2024(2024)
Abstract
Despite advancements in text-to-image generation (T2I), prior methods often
face text-image misalignment problems such as relation confusion in generated
images. Existing solutions involve cross-attention manipulation for better
compositional understanding or integrating large language models for improved
layout planning. However, the inherent alignment capabilities of T2I models are
still inadequate. By reviewing the link between generative and discriminative
modeling, we posit that T2I models' discriminative abilities may reflect their
text-image alignment proficiency during generation. In this light, we advocate
bolstering the discriminative abilities of T2I models to achieve more precise
text-to-image alignment for generation. We present a discriminative adapter
built on T2I models to probe their discriminative abilities on two
representative tasks and leverage discriminative fine-tuning to improve their
text-image alignment. As a bonus of the discriminative adapter, a
self-correction mechanism can leverage discriminative gradients to better align
generated images to text prompts during inference. Comprehensive evaluations
across three benchmark datasets, including both in-distribution and
out-of-distribution scenarios, demonstrate our method's superior generation
performance. Meanwhile, it achieves state-of-the-art discriminative performance
on the two discriminative tasks compared to other generative models.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined