PASA: Attack Agnostic Unsupervised Adversarial Detection using Prediction Attribution Sensitivity Analysis
arxiv(2024)
摘要
Deep neural networks for classification are vulnerable to adversarial
attacks, where small perturbations to input samples lead to incorrect
predictions. This susceptibility, combined with the black-box nature of such
networks, limits their adoption in critical applications like autonomous
driving. Feature-attribution-based explanation methods provide relevance of
input features for model predictions on input samples, thus explaining model
decisions. However, we observe that both model predictions and feature
attributions for input samples are sensitive to noise. We develop a practical
method for this characteristic of model prediction and feature attribution to
detect adversarial samples. Our method, PASA, requires the computation of two
test statistics using model prediction and feature attribution and can reliably
detect adversarial samples using thresholds learned from benign samples. We
validate our lightweight approach by evaluating the performance of PASA on
varying strengths of FGSM, PGD, BIM, and CW attacks on multiple image and
non-image datasets. On average, we outperform state-of-the-art statistical
unsupervised adversarial detectors on CIFAR-10 and ImageNet by 14% and 35%
ROC-AUC scores, respectively. Moreover, our approach demonstrates competitive
performance even when an adversary is aware of the defense mechanism.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要