Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 28, 2024

Commit

546578d

verified ·

1 Parent(s): 50167de

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -81,7 +81,7 @@ Exploring Refusal Loss Landscapes </title>
 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
-  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
   the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
   mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
 </p>

 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
+  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
   the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
   mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
 </p>