MFECLIP: CLIP With Mapping-Fusion Embedding for Text-Guided Image Editing

Fei Wu, Yongheng Ma, Hao Jin,Xiao-Yuan Jing,Guo-Ping Jiang

IEEE SIGNAL PROCESSING LETTERS(2024)

Cited 0|Views12
No score
Abstract
Recently, generative adversarial networks (GAN) have made remarkable progress, particularly with the advent of Contrastive Language-Image Pretraining (CLIP), which take image and text into a joint latent space, bridging the gap between these two modalities. Several impressive text-guided image editing methods based on GANs and CLIP have emerged. However, in these studies, most of them simply minimize the distance between the target image embedding and text embedding in the CLIP space, and take this objective as network's optimization goal, overlooking the real distance between them may be large. This may result in inability to accurately guide the editing process according to the text prompts and the changes in text-irrelevant attributes. To mitigate this issue, we propose a novel approach named CLIP with Mapping-Fusion Embedding (MFECLIP) for text-guided image editing, which comprises two components: the MFE Block and MFE Loss. Through the MFE Block, we obtain Mapping-Fusion Embedding (MFE), which can further eliminate the modality gap, and it can serve as a superior guide for editing process instead of the original text embedding. Based on contrastive learning, the MFE Loss is designed to achieve accurate alignment between the target image and text prompt. We have conducted extensive experiments on real datasets, CUB and Oxford, demonstrating the favorable performance of the proposed method.
More
Translated text
Key words
Semantics,Generative adversarial networks,Training,Task analysis,Flowering plants,Birds,Telecommunications,Text-guided image editing,GAN,CLIP
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined