Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

App Files Files Community

gregH commited on Feb 29

Commit

0ca6ac2

•

1 Parent(s): c0b1c2b

Update index.html

Browse files

Files changed (1) hide show

index.html +5 -3

index.html CHANGED Viewed

@@ -86,7 +86,7 @@ Exploring Refusal Loss Landscapes </title>
   the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
   <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
   mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
-  We visualize the 2-D landscape of the empirical Refusal Loss as below:
 </p>
 <div class="container jailbreak-intro-sec">
@@ -94,8 +94,10 @@ Exploring Refusal Loss Landscapes </title>
 </div>
 <p>
-  From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
-  the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
   the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
   is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
   Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more

   the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
   <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
   mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
+  We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:
 </p>
 <div class="container jailbreak-intro-sec">
 </div>
 <p>
+  We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected
+  from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful
+  behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries,
+  which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
   the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
   is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
   Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more