Be Your Own Neighborhood: Detecting Adversarial Examples by the Neighborhood Relations Built on Self-Supervised Learning

Abstract

Deep Neural Networks (DNNs) have achieved excellent performance in various fields. However, DNNs’ vulnerability to Adversarial Examples (AE) hinders their deployments to safety-critical applications. In this paper, we present BEYOND, an innovative AE detection frameworkdesigned for reliable predictions. BEYOND identifies AEs by distinguishing the AE’s abnormal relation with its augmented versions, i.e. neighbors, from two prospects: representation similarity and label consistency. An off-the-shelf Self-Supervised Learning (SSL) model is used to extract the representation and predict the label for its highly informative representation capacity compared to supervised learning models. We found clean samples maintain a high degree of representation similarity and label consistency relative to their neighbors, in contrast to AEs which exhibit significant discrepancies. We explain this obser vation and show that leveraging this discrepancy BEYOND can accurately detect AEs. Additionally, we develop a rigorous justification for the effectiveness of BEYOND. Furthermore, as a plug-and-play model, BEYOND can easily cooperate with the Adversarial Trained Classifier (ATC), achieving state-of-the-art (SOTA) robustness accuracy. Experimental results show that BEYOND outperforms baselines by a large margin, especially under adaptive attacks. Empowered by the robust relationship built on SSL, we found that BEYOND outperforms baselines in terms of both detection ability and speed.

Neighborhood Relations of AEs and Clean Samples

Neighborhood Relations of Benign Examples and AEs

Figure 1. Neighborhood Relations of AEs and Clean Samples.

Latent Neighborhood Graph (LNG) represents the relationship between the input sample and the reference sample as a graph, whose nodes are embeddings extracted by DDN and edges are built according to distances between the input node and reference nodes, and train a graph neural network to detect AEs.

Method Overview of BEYOND

Figure 2. Overview of BEYOND. First, we augment the input image to obtain a bunch of its neighbors. Then, we perform the label consistency detection mechanism on the classifier’s prediction of the input image and that of neighbors predicted by SSL’s classification head. Meanwhile, the representation similarity mechanism employs cosine distance to measure the similarity among the input image and its neighbors. Finally, The input image with poor label consistency or representation similarity is flagged as AE.

Detection Performance

**Table 1.**The Area Under the ROC Curve (AUC) of Different Adversarial Detection Approaches on CIFAR-10. LNG is not open-sourced and the data comes from its report. To align with baselines, classifier: ResNet110, FGSM: ε = 0.05, PGD: ε = 0.02. Note that BEYOND needs no AE for training, leading to the same value on both seen and unseen settings. The **bold** values are the best performance, and the *underlined italicized* values are the second-best performanc
AUC(%)	Unseen: Attacks used in training are preclude from tests				Seen: Attacks used in training are included in tests
AUC(%)	FGSM	PGD	AutoAttack	Square	FGSM	PGD	CW	AutoAttack	Square
DkNN	61.55	51.22	52.12	59.46	61.55	51.22	61.52	52.12	59.46
kNN	61.83	54.52	52.67	73.39	61.83	54.52	62.23	52.67	73.39
LID	71.08	61.33	55.56	66.18	73.61	67.98	55.68	56.33	85.94
Hu	84.51	58.59	53.55	95.82	84.51	58.59	91.02	53.55	95.82
Mao	95.33	82.61	81.95	85.76	95.33	82.61	83.10	81.95	85.76
LNG	98.51	63.14	58.47	94.71	99.88	91.39	89.74	84.03	98.82
BEYOND	98.89	99.28	99.16	99.27	98.89	99.28	99.20	99.16	99.27

Adaptive Attack

Attackers can design adaptive attacks to try to bypass BEYOND when the attacker knows all the parameters of the model and the detection strategy. For an SSL model with a feature extractor $f$, a projector `h`, and a classification head `g`, the classification branch can be formulated as $\mathbb{C} = f\circ g$ and the representation branch as `\mathbb{R} = f\circ h`. To attack effectively, the adversary must deceive the target model while guaranteeing the label consistency and representation similarity of the SSL model.

Label Consistency Loss Representation Similarity Loss Total Loss

$$ \displaystyle Loss_{label} = \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}\left(\mathbb{C}\left(W^i(x+\delta) \right), y_t\right) $$

where `k` represents the number of generated neighbors, `y_t` is the target class, and `\mathcal{L}` is the cross entropy loss function.

Models LLaMA-2-7B-Chat Vicuna-7B-V1.5

CIFAR-10

CIFAR-100

ImageNet

Average Malicious Refusal Rate0.959

Benign Refusal Rate0.050

CIFAR-100 Calibrated Reliability Diagram (Full)

BibTeX

@article{he2024beyond,
  author    = {Zhiyuan, He and Yijun, Yang and Pin-Yu, Chen and Qiang, Xu and Tsung-Yi, Ho},
  title     = {Be your own neighborhood: Detecting adversarial example by the neighborhood relations built on self-supervised learning},
  journal   = {ICML},
  year      = {2024},
}