Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 28

Commit

09bf795

•

1 Parent(s): 747e566

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -132,8 +132,8 @@ Exploring Refusal Loss Landscapes </title>
 <h2 id="demonstration">Demonstration</h2>
 <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
   different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
-  We report the average refusal rate across these 6 malicious user query datasets as True Positive Rate~(TPR) and the refusal rate
-  on benign user queries as False Positive Rate~(FPR).
 </p>

 <h2 id="demonstration">Demonstration</h2>
 <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
   different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
+  We demonstrate the average refusal rate across these 6 malicious user query datasets and the refusal rate
+  on benign user queries as the Benign Refusal Rate.
 </p>