Update index.html
Browse files- index.html +6 -4
index.html
CHANGED
@@ -92,9 +92,11 @@ Exploring Refusal Loss Landscapes </title>
|
|
92 |
|
93 |
<p>
|
94 |
From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
95 |
-
the
|
96 |
-
the gradient norm of
|
97 |
-
|
|
|
|
|
98 |
</p>
|
99 |
|
100 |
<div id="refusal-loss-formula" class="container">
|
@@ -156,7 +158,7 @@ We provide more details about the running flow of Gradient Cuff in the paper.
|
|
156 |
|
157 |
<h2 id="demonstration">Demonstration</h2>
|
158 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
|
159 |
-
against 6 different jailbreak attacks
|
160 |
Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
|
161 |
Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
|
162 |
shown in the provided bar chart.
|
|
|
92 |
|
93 |
<p>
|
94 |
From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
95 |
+
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
96 |
+
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
97 |
+
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
98 |
+
Below we present the definition of the Refusal Loss and the approximation of its function value and gradient, see more details about them and
|
99 |
+
the landscape drawing techniques in our paper.
|
100 |
</p>
|
101 |
|
102 |
<div id="refusal-loss-formula" class="container">
|
|
|
158 |
|
159 |
<h2 id="demonstration">Demonstration</h2>
|
160 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
|
161 |
+
against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
|
162 |
Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
|
163 |
Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
|
164 |
shown in the provided bar chart.
|