gregH commited on
Commit
7310da4
·
verified ·
1 Parent(s): 546578d

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +9 -10
index.html CHANGED
@@ -61,7 +61,7 @@ Exploring Refusal Loss Landscapes </title>
61
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
64
- detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-d Refusal Loss
65
  Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
66
  methods and show the defense performance against several Jailbreak attack methods.
67
  </p>
@@ -83,7 +83,7 @@ Exploring Refusal Loss Landscapes </title>
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
85
  the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
86
- mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
87
  </p>
88
 
89
  <div class="container jailbreak-intro-sec">
@@ -94,19 +94,18 @@ Exploring Refusal Loss Landscapes </title>
94
  From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
  the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
  the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
97
- Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of it's function value and gradient.
98
- See more details about the concept, approximation, gradient estimation and landscape drawing techniques of it in our paper.
99
  </p>
100
 
101
  <div id="refusal-loss-formula" class="container">
102
  <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
103
- <a href="#ECE-formula" class="selected">Refusal Loss Definition</a>
104
- <a href="#SCE-formula">Refusal Loss Approximation</a>
105
- <a href="#ACE-formula">Gradient Estimation</a>
106
  <div style="clear: both"></div>
107
  </div>
108
  <div id="refusal-loss-formula-content" class="row align-items-center">
109
- <span id="ECE-formula" class="formula" style="">
110
  $$
111
  \displaystyle
112
  \begin{aligned}
@@ -118,7 +117,7 @@ Exploring Refusal Loss Landscapes </title>
118
  \end{aligned}
119
  $$
120
  </span>
121
- <span id="SCE-formula" class="formula" style="display: none;">
122
  $$
123
  \displaystyle
124
  \begin{aligned}
@@ -130,7 +129,7 @@ Exploring Refusal Loss Landscapes </title>
130
  \end{aligned}
131
  $$
132
  </span>
133
- <span id="ACE-formula" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
134
  </div>
135
  </div>
136
 
 
61
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
64
+ detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-D Refusal Loss
65
  Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
66
  methods and show the defense performance against several Jailbreak attack methods.
67
  </p>
 
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
85
  the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
86
+ mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. Using the approximation, we visualize the 2-D landscape of the Refusal Loss below:
87
  </p>
88
 
89
  <div class="container jailbreak-intro-sec">
 
94
  From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
  the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
  the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
97
+ Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of its function value and gradient, see more details about them and the landscape drawing techniques in our paper.
 
98
  </p>
99
 
100
  <div id="refusal-loss-formula" class="container">
101
  <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
102
+ <a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
103
+ <a href="#Refusal-Loss-Approximation">Refusal Loss Approximation</a>
104
+ <a href="#Gradient-Estimation">Gradient Estimation</a>
105
  <div style="clear: both"></div>
106
  </div>
107
  <div id="refusal-loss-formula-content" class="row align-items-center">
108
+ <span id="Refusal-Loss" class="formula" style="">
109
  $$
110
  \displaystyle
111
  \begin{aligned}
 
117
  \end{aligned}
118
  $$
119
  </span>
120
+ <span id="Refusal-Loss-Approximation" class="formula" style="display: none;">
121
  $$
122
  \displaystyle
123
  \begin{aligned}
 
129
  \end{aligned}
130
  $$
131
  </span>
132
+ <span id="Gradient-Estimation" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
133
  </div>
134
  </div>
135