gregH commited on
Commit
50167de
1 Parent(s): 8a3a312

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +3 -3
index.html CHANGED
@@ -61,9 +61,9 @@ Exploring Refusal Loss Landscapes </title>
61
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
64
- detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
65
- landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
66
- methods and show the defense performance.
67
  </p>
68
 
69
  <h2 id="what-is-jailbreak">What is Jailbreak?</h2>
 
61
  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
64
+ detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-d Refusal Loss
65
+ Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
66
+ methods and show the defense performance against several Jailbreak attack methods.
67
  </p>
68
 
69
  <h2 id="what-is-jailbreak">What is Jailbreak?</h2>