Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
CoRR(2024)
摘要
Diffusion models (DMs) have achieved remarkable success in text-to-image
generation, but they also pose safety risks, such as the potential generation
of harmful content and copyright violations. The techniques of machine
unlearning, also known as concept erasing, have been developed to address these
risks. However, these techniques remain vulnerable to adversarial prompt
attacks, which can prompt DMs post-unlearning to regenerate undesired images
containing concepts (such as nudity) meant to be erased. This work aims to
enhance the robustness of concept erasing by integrating the principle of
adversarial training (AT) into machine unlearning, resulting in the robust
unlearning framework referred to as AdvUnlearn. However, achieving this
effectively and efficiently is highly nontrivial. First, we find that a
straightforward implementation of AT compromises DMs' image generation quality
post-unlearning. To address this, we develop a utility-retaining regularization
on an additional retain set, optimizing the trade-off between concept erasure
robustness and model utility in AdvUnlearn. Moreover, we identify the text
encoder as a more suitable module for robustification compared to UNet,
ensuring unlearning effectiveness. And the acquired text encoder can serve as a
plug-and-play robust unlearner for various DM types. Empirically, we perform
extensive experiments to demonstrate the robustness advantage of AdvUnlearn
across various DM unlearning scenarios, including the erasure of nudity,
objects, and style concepts. In addition to robustness, AdvUnlearn also
achieves a balanced tradeoff with model utility. To our knowledge, this is the
first work to systematically explore robust DM unlearning through AT, setting
it apart from existing methods that overlook robustness in concept erasing.
Codes are available at: https://github.com/OPTML-Group/AdvUnlearn
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要