ARTIST: Improving the Generation of Text-rich Images by Disentanglement
arxiv(2024)
摘要
Diffusion models have demonstrated exceptional capabilities in generating a
broad spectrum of visual content, yet their proficiency in rendering text is
still limited: they often generate inaccurate characters or words that fail to
blend well with the underlying image. To address these shortcomings, we
introduce a new framework named ARTIST. This framework incorporates a dedicated
textual diffusion model to specifically focus on the learning of text
structures. Initially, we pretrain this textual model to capture the
intricacies of text representation. Subsequently, we finetune a visual
diffusion model, enabling it to assimilate textual structure information from
the pretrained textual model. This disentangled architecture design and the
training strategy significantly enhance the text rendering ability of the
diffusion models for text-rich image generation. Additionally, we leverage the
capabilities of pretrained large language models to better interpret user
intentions, contributing to improved generation quality. Empirical results on
the MARIO-Eval benchmark underscore the effectiveness of the proposed method,
showing an improvement of up to 15% in various metrics.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要