Single Word Change is All You Need: Designing Attacks and Defenses for Text Classifiers

Lei Xu,Sarah Alnegheimish,Laure Berti-Equille,Alfredo Cuesta-Infante,Kalyan Veeramachaneni

CoRR（2024）

引用 0|浏览1

暂无评分

摘要

In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric h̊o̊ to quantitatively assess a classifier's robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve h̊o̊ by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves h̊o̊ by 14.6 SP-Attack by 30.4 attack success rate of existing attack methods that involve multiple-word perturbations.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要