Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers
CoRR(2024)
摘要
In text classification, creating an adversarial example means subtly
perturbing a few words in a sentence without changing its meaning, causing it
to be misclassified by a classifier. A concerning observation is that a
significant portion of adversarial examples generated by existing methods
change only one word. This single-word perturbation vulnerability represents a
significant weakness in classifiers, which malicious users can exploit to
efficiently create a multitude of adversarial examples. This paper studies this
problem and makes the following key contributions: (1) We introduce a novel
metric h̊o̊ to quantitatively assess a classifier's robustness against
single-word perturbation. (2) We present the SP-Attack, designed to exploit the
single-word perturbation vulnerability, achieving a higher attack success rate,
better preserving sentence meaning, while reducing computation costs compared
to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims
to improve h̊o̊ by applying data augmentation in learning. Experimental
results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense
improves h̊o̊ by 14.6
SP-Attack by 30.4
attack success rate of existing attack methods that involve multiple-word
perturbations.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要