|
<!DOCTYPE html> |
|
<html lang="en-US"> |
|
<head> |
|
<meta charset="UTF-8"> |
|
|
|
|
|
<title>Gradient Cuff | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by |
|
Exploring Refusal Loss Landscapes </title> |
|
<meta property="og:title" content="Gradient Cuff" /> |
|
<meta property="og:locale" content="en_US" /> |
|
<meta name="description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" /> |
|
<meta property="og:description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" /> |
|
<script type="application/ld+json"> |
|
{"@context":"https://schema.org","@type":"WebSite","description":"Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes","headline":"Gradient Cuff","name":"Gradient Cuff","url":"https://huggingface.co/spaces/gregH/Gradient Cuff"}</script> |
|
|
|
|
|
<link rel="preconnect" href="https://fonts.gstatic.com"> |
|
<link rel="preload" href="https://fonts.googleapis.com/css?family=Open+Sans:400,700&display=swap" as="style" type="text/css" crossorigin> |
|
<meta name="viewport" content="width=device-width, initial-scale=1"> |
|
<meta name="theme-color" content="#157878"> |
|
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"> |
|
|
|
<link rel="stylesheet" href="assets/css/bootstrap/bootstrap.min.css?v=90447f115a006bc45b738d9592069468b20e2551"> |
|
<link rel="stylesheet" href="assets/css/style.css?v=90447f115a006bc45b738d9592069468b20e2551"> |
|
|
|
<link rel="stylesheet" href="assets/css/custom_style.css?v=90447f115a006bc45b738d9592069468b20e2551"> |
|
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> |
|
<link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css"> |
|
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script> |
|
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.9.4/Chart.js"></script> |
|
<script src="assets/js/calibration.js?v=90447f115a006bc45b738d9592069468b20e2551"></script> |
|
|
|
|
|
|
|
|
|
|
|
|
|
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script> |
|
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> |
|
|
|
|
|
|
|
|
|
</head> |
|
<body> |
|
<a id="skip-to-content" href="#content">Skip to the content.</a> |
|
|
|
<header class="page-header" role="banner"> |
|
<h1 class="project-name">Gradient Cuff</h1> |
|
<h2 class="project-tagline">Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes</h2> |
|
|
|
|
|
</header> |
|
|
|
<main id="content" class="main-content" role="main"> |
|
<h2 id="introduction">Introduction</h2> |
|
|
|
<p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a |
|
query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align |
|
these LLMs to human values using advanced training techniques such as Reinforcement Learning from |
|
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial |
|
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, |
|
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to |
|
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss |
|
landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense |
|
methods and show the defense performance. |
|
</p> |
|
|
|
<h2 id="what-is-jailbreak">What is Jailbreak?</h2> |
|
<p>Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent |
|
the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into |
|
generating hate speech targeting certain groups of people, as demonstrated below.</p> |
|
|
|
<div class="container"> |
|
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec"> |
|
<img id="jailbreak-intro-img" src="./jailbreak.png" /> |
|
</div> |
|
</div> |
|
|
|
<h3 id="refusal-loss">Refusal Loss</h3> |
|
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of |
|
autoregressive sampling-based generation. With this randomness, it is an |
|
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but |
|
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss and visualize its 2-d |
|
landscape below: |
|
</p> |
|
|
|
<div class="container jailbreak-intro-sec"> |
|
<div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div> |
|
</div> |
|
|
|
<div id="refusal-loss-formula" class="container"> |
|
<div id="refusal-loss-formula-list" class="row align-items-center formula-list"> |
|
<a href="#ECE-formula" class="selected">Refusal Loss</a> |
|
<a href="#SCE-formula">Refusal Loss Approximation</a> |
|
<a href="#ACE-formula">Gradient Estimation</a> |
|
<div style="clear: both"></div> |
|
</div> |
|
<div id="refusal-loss-formula-content" class="row align-items-center"> |
|
<span id="ECE-formula" class="formula" style=""> |
|
$$ |
|
\displaystyle |
|
\begin{aligned} |
|
\phi_\theta(x)&=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)\\ |
|
JB (y) &= \begin{cases} |
|
1 \text{, if $y$ contains any jailbreak keyword;} \\ |
|
0 \text{, otherwise.} |
|
\end{cases} |
|
\end{aligned} |
|
$$ |
|
</span> |
|
<span id="SCE-formula" class="formula" style="display: none;"> |
|
$$ |
|
\displaystyle |
|
\begin{aligned} |
|
f_\theta(x)=1-\frac{1}{N}\sum_{i=1}^N JB(y_i)\\ |
|
JB (y_i) &= \begin{cases} |
|
1 \text{, if $y_i$ contains any jailbreak keyword;} \\ |
|
0 \text{, otherwise.} |
|
\end{cases} |
|
\end{aligned} |
|
$$ |
|
</span> |
|
<span id="ACE-formula" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span> |
|
</div> |
|
</div> |
|
|
|
<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2> |
|
|
|
<div class="container"><img id="gradient-cuff-header" src="images/header.png" /></div> |
|
|
|
<h2 id="demonstration">Demonstration</h2> |
|
<p>In the current research, a reliability diagram is drawn to show the calibration performance of a model. However, since |
|
reliability diagrams often only provide fixed bar graphs statically, further explanation from the chart is limited. In |
|
this demonstration, we show how to make reliability diagrams interactive and insightful to help researchers and |
|
developers gain more insights from the graph. Specifically, we provide three CIFAR-100 classification models |
|
in this demonstration. Multiple Bin numbers are also supported </p> |
|
|
|
<p>We hope this tool could also facilitate the development process.</p> |
|
|
|
<div id="jailbreak-demo" class="container"> |
|
<div class="row align-items-center"> |
|
<div class="row" style="margin: 10px 0 0"> |
|
<div class="models-list"> |
|
<span style="margin-right: 1em;">Models</span> |
|
<span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span> |
|
<span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span> |
|
</div> |
|
</div> |
|
</div> |
|
<div class="row align-items-center"> |
|
<div class="col-4"> |
|
<div id="defense-methods"> |
|
<div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div> |
|
<div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div> |
|
</div> |
|
<div class="row align-items-center"> |
|
<div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.95875</span></div> |
|
</div> |
|
<div class="row align-items-center"> |
|
<div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.05000</span></div> |
|
</div> |
|
</div> |
|
<div class="col-8"> |
|
<figure class="figure"> |
|
<img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" /> |
|
<div class="slider-container"> |
|
<div class="slider-label"><span>Perplexity Threshold</span></div> |
|
<div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div> |
|
</div> |
|
<div class="slider-container"> |
|
<div class="slider-label"><span>Gradient Threshold</span></div> |
|
<div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div> |
|
</div> |
|
<figcaption class="figure-caption"> |
|
</figcaption> |
|
</figure> |
|
</div> |
|
</div> |
|
</div> |
|
|
|
<h2 id="citations">Citations</h2> |
|
<p>If you find Neural Clamping helpful and useful for your research, please cite our main paper as follows:</p> |
|
|
|
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{hsiung2023nctv, |
|
title={{NCTV: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes}}, |
|
author={Lei Hsiung, Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho}, |
|
booktitle={Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence}, |
|
publisher={Association for the Advancement of Artificial Intelligence}, |
|
year={2023}, |
|
month={February} |
|
} |
|
|
|
@misc{tang2022neural_clamping, |
|
title={{Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration}}, |
|
author={Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho}, |
|
year={2022}, |
|
eprint={2209.11604}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG} |
|
} |
|
</code></pre></div></div> |
|
|
|
|
|
<footer class="site-footer"> |
|
|
|
<span class="site-footer-owner">Gradient Cuff is maintained by <a href="https://gregxmhu.github.io/">Xiaomeng Hu</a></a>.</span> |
|
|
|
</footer> |
|
</main> |
|
</body> |
|
</html> |
|
|