Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
arxiv(2024)
摘要
Text-to-image diffusion models have shown great success in generating
high-quality text-guided images. Yet, these models may still fail to
semantically align generated images with the provided text prompts, leading to
problems like incorrect attribute binding and/or catastrophic object neglect.
Given the pervasive object-oriented structure underlying text prompts, we
introduce a novel object-conditioned Energy-Based Attention Map Alignment
(EBAMA) method to address the aforementioned problems. We show that an
object-centric attribute binding loss naturally emerges by approximately
maximizing the log-likelihood of a z-parameterized energy-based model with
the help of the negative sampling technique. We further propose an
object-centric intensity regularizer to prevent excessive shifts of objects
attention towards their attributes. Extensive qualitative and quantitative
experiments, including human evaluation, on several challenging benchmarks
demonstrate the superior performance of our method over previous strong
counterparts. With better aligned attention maps, our approach shows great
promise in further enhancing the text-controlled image editing ability of
diffusion models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要