Large Language Models Relearn Removed Concepts
CoRR(2024)
摘要
Advances in model editing through neuron pruning hold promise for removing
undesirable concepts from large language models. However, it remains unclear
whether models have the capacity to reacquire pruned concepts after editing. To
investigate this, we evaluate concept relearning in models by tracking concept
saliency and similarity in pruned neurons during retraining. Our findings
reveal that models can quickly regain performance post-pruning by relocating
advanced concepts to earlier layers and reallocating pruned concepts to primed
neurons with similar semantics. This demonstrates that models exhibit
polysemantic capacities and can blend old and new concepts in individual
neurons. While neuron pruning provides interpretability into model concepts,
our results highlight the challenges of permanent concept removal for improved
model safety. Monitoring concept reemergence and developing techniques
to mitigate relearning of unsafe concepts will be important directions for more
robust model editing. Overall, our work strongly demonstrates the resilience
and fluidity of concept representations in LLMs post concept removal.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要