gregH commited on
Commit
aa27052
·
verified ·
1 Parent(s): 45edfad

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +4 -2
index.html CHANGED
@@ -90,9 +90,11 @@ Exploring Refusal Loss Landscapes </title>
90
  </div>
91
 
92
  <p>
93
- In general, $$\phi_\theta(x) < 0.5$$
 
 
94
  Below we present the definition of the <strong>Refusal Loss</strong> and how we approximate it's function value and gradient.
95
- See more details about the concept, approximation and 2-d landscape drawing of it in our paper.
96
  </p>
97
 
98
  <div id="refusal-loss-formula" class="container">
 
90
  </div>
91
 
92
  <p>
93
+ From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
94
+ the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
95
+ the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5.
96
  Below we present the definition of the <strong>Refusal Loss</strong> and how we approximate it's function value and gradient.
97
+ See more details about the concept, approximation, gradient estimation and landscape drawing of it in our paper.
98
  </p>
99
 
100
  <div id="refusal-loss-formula" class="container">