Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
CoRR(2023)
摘要
Real-world natural language processing systems need to be robust to human
adversaries. Collecting examples of human adversaries for training is an
effective but expensive solution. On the other hand, training on synthetic
attacks with small perturbations - such as word-substitution - does not
actually improve robustness to human adversaries. In this paper, we propose an
adversarial training framework that uses limited human adversarial examples to
generate more useful adversarial examples at scale. We demonstrate the
advantages of this system on the ANLI and hate speech detection benchmark
datasets - both collected via an iterative, adversarial
human-and-model-in-the-loop procedure. Compared to training only on observed
human attacks, also training on our synthetic adversarial examples improves
model robustness to future rounds. In ANLI, we see accuracy gains on the
current set of attacks (44.1
human generated attacks (32.5
speech detection, we see AUC gains on current attacks (0.76 → 0.84) and a
future round (0.77 → 0.79). Attacks from methods that do not learn the
distribution of existing human adversaries, meanwhile, degrade robustness.
更多查看译文
关键词
robustness,attacks,human-like
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要