Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 28, 2024

Commit

f134f1b

verified ·

1 Parent(s): 273a1f2

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -155,10 +155,11 @@ We provide more details about the running flow of Gradient Cuff in the paper.
 </p>
 <h2 id="demonstration">Demonstration</h2>
-<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
-  different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
-  We demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal Rate and the refusal rate
-  on benign user queries as the Benign Refusal Rate. The detection performance against individual jailbreak attacks is shown in the below bar chart.
 </p>

 </p>
 <h2 id="demonstration">Demonstration</h2>
+<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
+  against 6 different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
+  Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
+  Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
+  shown in the provided bar chart.
 </p>