Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

App Files Files Community

gregH commited on Feb 28, 2024

Commit

7310da4

verified ·

1 Parent(s): 546578d

Update index.html

Browse files

Files changed (1) hide show

index.html +9 -10

index.html CHANGED Viewed

@@ -61,7 +61,7 @@ Exploring Refusal Loss Landscapes </title>
   Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
-  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-d Refusal Loss
   Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
   methods and show the defense performance against several Jailbreak attack methods.
 </p>
@@ -83,7 +83,7 @@ Exploring Refusal Loss Landscapes </title>
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
   sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
   the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
-  mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
 </p>
 <div class="container jailbreak-intro-sec">
@@ -94,19 +94,18 @@ Exploring Refusal Loss Landscapes </title>
   From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
   the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
   the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
-  Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of it's function value and gradient.
-  See more details about the concept, approximation, gradient estimation and landscape drawing techniques of it in our paper.
 </p>
 <div id="refusal-loss-formula" class="container">
 <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
-  <a href="#ECE-formula" class="selected">Refusal Loss Definition</a>
-  <a href="#SCE-formula">Refusal Loss Approximation</a>
-  <a href="#ACE-formula">Gradient Estimation</a>
   <div style="clear: both"></div>
 </div>
 <div id="refusal-loss-formula-content" class="row align-items-center">
-  <span id="ECE-formula" class="formula" style="">
     $$
     \displaystyle
     \begin{aligned}
@@ -118,7 +117,7 @@ Exploring Refusal Loss Landscapes </title>
     \end{aligned}
     $$
   </span>
-  <span id="SCE-formula" class="formula" style="display: none;">
     $$
     \displaystyle
     \begin{aligned}
@@ -130,7 +129,7 @@ Exploring Refusal Loss Landscapes </title>
     \end{aligned}
     $$
   </span>
-  <span id="ACE-formula" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
 </div>
 </div>

   Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
+  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-D Refusal Loss
   Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
   methods and show the defense performance against several Jailbreak attack methods.
 </p>
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
   sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
   the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
+  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. Using the approximation, we visualize the 2-D landscape of the Refusal Loss below:
 </p>
 <div class="container jailbreak-intro-sec">
   From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
   the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
   the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
+  Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of its function value and gradient, see more details about them and the landscape drawing techniques in our paper.
 </p>
 <div id="refusal-loss-formula" class="container">
 <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
+  <a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
+  <a href="#Refusal-Loss-Approximation">Refusal Loss Approximation</a>
+  <a href="#Gradient-Estimation">Gradient Estimation</a>
   <div style="clear: both"></div>
 </div>
 <div id="refusal-loss-formula-content" class="row align-items-center">
+  <span id="Refusal-Loss" class="formula" style="">
     $$
     \displaystyle
     \begin{aligned}
     \end{aligned}
     $$
   </span>
+  <span id="Refusal-Loss-Approximation" class="formula" style="display: none;">
     $$
     \displaystyle
     \begin{aligned}
     \end{aligned}
     $$
   </span>
+  <span id="Gradient-Estimation" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
 </div>
 </div>