User-Friendly Customized Generation with Multi-Modal Prompts
CoRR(2024)
摘要
Text-to-image generation models have seen considerable advancement, catering
to the increasing interest in personalized image creation. Current
customization techniques often necessitate users to provide multiple images
(typically 3-5) for each customized object, along with the classification of
these objects and descriptive textual prompts for scenes. This paper questions
whether the process can be made more user-friendly and the customization more
intricate. We propose a method where users need only provide images along with
text for each customization topic, and necessitates only a single image per
visual concept. We introduce the concept of a ``multi-modal prompt'', a novel
integration of text and images tailored to each customization concept, which
simplifies user interaction and facilitates precise customization of both
objects and scenes. Our proposed paradigm for customized text-to-image
generation surpasses existing finetune-based methods in user-friendliness and
the ability to customize complex objects with user-friendly inputs. Our code is
available at
$\href{https://github.com/zhongzero/Multi-Modal-Prompt}{https://github.com/zhongzero/Multi-Modal-Prompt}$.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要