Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge
CoRR(2024)
摘要
The democratization of pre-trained language models through open-source
initiatives has rapidly advanced innovation and expanded access to cutting-edge
technologies. However, this openness also brings significant security risks,
including backdoor attacks, where hidden malicious behaviors are triggered by
specific inputs, compromising natural language processing (NLP) system
integrity and reliability. This paper suggests that merging a backdoored model
with other homogeneous models can remediate backdoor vulnerabilities even if
such models are not entirely secure. In our experiments, we explore various
models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets
(SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive
approaches, our method offers an effective and efficient inference-stage
defense against backdoor attacks without additional resources or specific
knowledge. Our approach consistently outperforms the other advanced baselines,
leading to an average of 75
merging has been an established approach for improving model performance, the
extra advantage it provides regarding defense can be seen as a cost-free bonus.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要