Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Aradhana Sinha,Ananth Balashankar,Ahmad Beirami, Thi Avrahami,Jilin Chen,Alex Beutel

CoRR（2023）

引用 0|浏览5

暂无评分

摘要

Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1 human generated attacks (32.5 speech detection, we see AUC gains on current attacks (0.76 → 0.84) and a future round (0.77 → 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.

查看译文

关键词

robustness,attacks,human-like

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要