gregH commited on
Commit
fc5cf53
1 Parent(s): 4c6a875

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +2 -2
index.html CHANGED
@@ -93,9 +93,9 @@ Exploring Refusal Loss Landscapes </title>
93
  <p>
94
  From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
  the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
- the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded the probability that the LLM won't reject the user query).
97
  Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of it's function value and gradient.
98
- See more details about the concept, approximation, gradient estimation and landscape drawing of it in our paper.
99
  </p>
100
 
101
  <div id="refusal-loss-formula" class="container">
 
93
  <p>
94
  From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
  the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
+ the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5 (this is a naive detector bacause the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
97
  Below we present the definition of the <strong>Refusal Loss</strong> and the approximation of it's function value and gradient.
98
+ See more details about the concept, approximation, gradient estimation and landscape drawing techniques of it in our paper.
99
  </p>
100
 
101
  <div id="refusal-loss-formula" class="container">