Inequality phenomenon in l-adversarial training, and its unrealized threats.

ICLR 2023(2023)

引用 0|浏览38
暂无评分
摘要
The appearance of adversarial examples raises attention from both academia and industry. Along with the attack-defense arms race, adversarial training is the most effective against adversarial examples. However, we find inequality phenomena occur during the $l_{\infty}$-adversarial training, that few features dominate the prediction made by the adversarially trained model. We systematically evaluate such inequality phenomena by extensive experiments and find such phenomena become more obvious when performing adversarial training with increasing adversarial strength (evaluated by $\epsilon$). We hypothesize such inequality phenomena make $l_{\infty}$-adversarially trained model less reliable than the standard trained model when few ``important features" are influenced. To validate our hypothesis, we proposed two simple attacks that either perturb or replace important features with noise or occlusion. Experiments show that $l_{\infty}$-adversarially trained model can be easily attacked when the few important features are influenced. Our work shed light on the limitation of the practicality of $l_{\infty}$-adversarial training.
更多
查看译文
关键词
Adversarial training,Adversarial robustness,Adversarial feature represenation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要