Update index.html
Browse files- index.html +3 -3
index.html
CHANGED
@@ -61,9 +61,9 @@ Exploring Refusal Loss Landscapes </title>
|
|
61 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
63 |
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
|
64 |
-
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the
|
65 |
-
|
66 |
-
methods and show the defense performance.
|
67 |
</p>
|
68 |
|
69 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|
|
|
61 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
63 |
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
|
64 |
+
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-d Refusal Loss
|
65 |
+
Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
66 |
+
methods and show the defense performance against several Jailbreak attack methods.
|
67 |
</p>
|
68 |
|
69 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|