gregH commited on
Commit
546578d
·
verified ·
1 Parent(s): 50167de

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +1 -1
index.html CHANGED
@@ -81,7 +81,7 @@ Exploring Refusal Loss Landscapes </title>
81
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
82
  autoregressive sampling-based generation. With this randomness, it is an
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
- sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
85
  the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
86
  mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
87
  </p>
 
81
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
82
  autoregressive sampling-based generation. With this randomness, it is an
83
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
84
+ sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to represent the probability with which
85
  the LLM won't reject the input user query. Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
86
  mean of the Jailbroken results (1 denotes a successful jailbreak and 0 otherwise) to approximate the function value. Using the approximation, we visualize the 2-d landscape of the Refusal Loss below:
87
  </p>