gregH commited on
Commit
0ca6ac2
1 Parent(s): c0b1c2b

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +5 -3
index.html CHANGED
@@ -86,7 +86,7 @@ Exploring Refusal Loss Landscapes </title>
86
  the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
87
  <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
88
  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
89
- We visualize the 2-D landscape of the empirical Refusal Loss as below:
90
  </p>
91
 
92
  <div class="container jailbreak-intro-sec">
@@ -94,8 +94,10 @@ Exploring Refusal Loss Landscapes </title>
94
  </div>
95
 
96
  <p>
97
- From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
98
- the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
 
 
99
  the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
100
  is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
101
  Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
 
86
  the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
87
  <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
88
  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
89
+ We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:
90
  </p>
91
 
92
  <div class="container jailbreak-intro-sec">
 
94
  </div>
95
 
96
  <p>
97
+ We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected
98
+ from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful
99
+ behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries,
100
+ which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
101
  the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
102
  is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
103
  Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more