gregH commited on
Commit
035ee3e
·
verified ·
1 Parent(s): 8992c9a

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +1 -1
index.html CHANGED
@@ -91,7 +91,7 @@ Exploring Refusal Loss Landscapes </title>
91
  </div>
92
 
93
  <p>
94
- From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
  the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
  the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
97
  is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
 
91
  </div>
92
 
93
  <p>
94
+ From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
95
  the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
96
  the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
97
  is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).