MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
CoRR(2024)
摘要
The deployment of multimodal large language models (MLLMs) has brought forth
a unique vulnerability: susceptibility to malicious attacks through visual
inputs. We delve into the novel challenge of defending MLLMs against such
attacks. We discovered that images act as a "foreign language" that is not
considered during alignment, which can make MLLMs prone to producing harmful
responses. Unfortunately, unlike the discrete tokens considered in text-based
LLMs, the continuous nature of image signals presents significant alignment
challenges, which poses difficulty to thoroughly cover the possible scenarios.
This vulnerability is exacerbated by the fact that open-source MLLMs are
predominantly fine-tuned on limited image-text pairs that is much less than the
extensive text-based pretraining corpus, which makes the MLLMs more prone to
catastrophic forgetting of their original abilities during explicit alignment
tuning. To tackle these challenges, we introduce MLLM-Protector, a
plug-and-play strategy combining a lightweight harm detector and a response
detoxifier. The harm detector's role is to identify potentially harmful outputs
from the MLLM, while the detoxifier corrects these outputs to ensure the
response stipulates to the safety standards. This approach effectively
mitigates the risks posed by malicious visual inputs without compromising the
model's overall performance. Our results demonstrate that MLLM-Protector offers
a robust solution to a previously unaddressed aspect of MLLM security.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要