Update index.html
Browse files- index.html +9 -10
index.html
CHANGED
@@ -61,7 +61,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
61 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
63 |
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
|
64 |
-
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-
|
65 |
Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
66 |
methods and show the defense performance against several Jailbreak attack methods.
|
67 |
</p>
|
@@ -83,7 +83,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
|
85 |
the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
86 |
-
mean of the Jailbroken results (1
|
87 |
</p>
|
88 |
|
89 |
<div class="container jailbreak-intro-sec">
|
@@ -94,19 +94,18 @@ Exploring Refusal Loss Landscapes </title>
|
|
94 |
From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
95 |
the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
96 |
the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
97 |
-
Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of
|
98 |
-
See more details about the concept, approximation, gradient estimation and landscape drawing techniques of it in our paper.
|
99 |
</p>
|
100 |
|
101 |
<div id="refusal-loss-formula" class="container">
|
102 |
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
|
103 |
-
<a href="#
|
104 |
-
<a href="#
|
105 |
-
<a href="#
|
106 |
<div style="clear: both"></div>
|
107 |
</div>
|
108 |
<div id="refusal-loss-formula-content" class="row align-items-center">
|
109 |
-
<span id="
|
110 |
$$
|
111 |
\displaystyle
|
112 |
\begin{aligned}
|
@@ -118,7 +117,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
118 |
\end{aligned}
|
119 |
$$
|
120 |
</span>
|
121 |
-
<span id="
|
122 |
$$
|
123 |
\displaystyle
|
124 |
\begin{aligned}
|
@@ -130,7 +129,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
130 |
\end{aligned}
|
131 |
$$
|
132 |
</span>
|
133 |
-
<span id="
|
134 |
</div>
|
135 |
</div>
|
136 |
|
|
|
61 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
63 |
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
|
64 |
+
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-D Refusal Loss
|
65 |
Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
66 |
methods and show the defense performance against several Jailbreak attack methods.
|
67 |
</p>
|
|
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
|
85 |
the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
86 |
+
mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. Using the approximation, we visualize the 2-D landscape of the Refusal Loss below:
|
87 |
</p>
|
88 |
|
89 |
<div class="container jailbreak-intro-sec">
|
|
|
94 |
From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
95 |
the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
96 |
the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
97 |
+
Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of its function value and gradient, see more details about them and the landscape drawing techniques in our paper.
|
|
|
98 |
</p>
|
99 |
|
100 |
<div id="refusal-loss-formula" class="container">
|
101 |
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
|
102 |
+
<a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
|
103 |
+
<a href="#Refusal-Loss-Approximation">Refusal Loss Approximation</a>
|
104 |
+
<a href="#Gradient-Estimation">Gradient Estimation</a>
|
105 |
<div style="clear: both"></div>
|
106 |
</div>
|
107 |
<div id="refusal-loss-formula-content" class="row align-items-center">
|
108 |
+
<span id="Refusal-Loss" class="formula" style="">
|
109 |
$$
|
110 |
\displaystyle
|
111 |
\begin{aligned}
|
|
|
117 |
\end{aligned}
|
118 |
$$
|
119 |
</span>
|
120 |
+
<span id="Refusal-Loss-Approximation" class="formula" style="display: none;">
|
121 |
$$
|
122 |
\displaystyle
|
123 |
\begin{aligned}
|
|
|
129 |
\end{aligned}
|
130 |
$$
|
131 |
</span>
|
132 |
+
<span id="Gradient-Estimation" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
|
133 |
</div>
|
134 |
</div>
|
135 |
|