Update index.html
Browse files- index.html +2 -2
index.html
CHANGED
@@ -98,8 +98,8 @@ Exploring Refusal Loss Landscapes </title>
|
|
98 |
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
99 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
100 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
101 |
-
Below we present the definition of the Refusal Loss
|
102 |
-
the landscape drawing techniques in our paper.
|
103 |
</p>
|
104 |
|
105 |
<div id="refusal-loss-formula" class="container">
|
|
|
98 |
the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
99 |
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
100 |
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
101 |
+
Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
|
102 |
+
details about them and the landscape drawing techniques in our paper.
|
103 |
</p>
|
104 |
|
105 |
<div id="refusal-loss-formula" class="container">
|