Update index.html
Browse files- index.html +1 -1
index.html
CHANGED
@@ -91,7 +91,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
91 |
</div>
|
92 |
|
93 |
<p>
|
94 |
-
From
|
95 |
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
96 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
97 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
|
|
91 |
</div>
|
92 |
|
93 |
<p>
|
94 |
+
From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
95 |
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
96 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
97 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|