Iteratively Prompting Multimodal LLMs to Reproduce Natural and AI-Generated Images
CoRR(2024)
Abstract
With the digital imagery landscape rapidly evolving, image stocks and
AI-generated image marketplaces have become central to visual media.
Traditional stock images now exist alongside innovative platforms that trade in
prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3
and Midjourney. This paper studies the possibility of employing multi-modal
models with enhanced visual understanding to mimic the outputs of these
platforms, introducing an original attack strategy. Our method leverages
fine-tuned CLIP models, a multi-label classifier, and the descriptive
capabilities of GPT-4V to create prompts that generate images similar to those
available in marketplaces and from premium stock image providers, yet at a
markedly lower expense. In presenting this strategy, we aim to spotlight a new
class of economic and security considerations within the realm of digital
imagery. Our findings, supported by both automated metrics and human
assessment, reveal that comparable visual content can be produced for a
fraction of the prevailing market prices (0.23 -0.27 per image), emphasizing
the need for awareness and strategic discussions about the integrity of digital
media in an increasingly AI-integrated landscape. Our work also contributes to
the field by assembling a dataset consisting of approximately 19 million
prompt-image pairs generated by the popular Midjourney platform, which we plan
to release publicly.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined