No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
CoRR(2024)
Abstract
The existing safety alignment of Large Language Models (LLMs) is found
fragile and could be easily attacked through different strategies, such as
through fine-tuning on a few harmful examples or manipulating the prefix of the
generation results. However, the attack mechanisms of these strategies are
still underexplored. In this paper, we ask the following question:
while these approaches can all significantly compromise safety, do
their attack mechanisms exhibit strong similarities? To answer this question,
we break down the safeguarding process of an LLM when encountered with harmful
instructions into three stages: (1) recognizing harmful instructions, (2)
generating an initial refusing tone, and (3) completing the refusal response.
Accordingly, we investigate whether and how different attack strategies could
influence each stage of this safeguarding process. We utilize techniques such
as logit lens and activation patching to identify model components that drive
specific behavior, and we apply cross-model probing to examine representation
shifts after an attack. In particular, we analyze the two most representative
types of attack approaches: Explicit Harmful Attack (EHA) and Identity-Shifting
Attack (ISA). Surprisingly, we find that their attack mechanisms diverge
dramatically. Unlike ISA, EHA tends to aggressively target the harmful
recognition stage. While both EHA and ISA disrupt the latter two stages, the
extent and mechanisms of their attacks differ significantly. Our findings
underscore the importance of understanding LLMs' internal safeguarding process
and suggest that diverse defense mechanisms are required to effectively cope
with various types of attacks.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined