File size: 8,942 Bytes
ce3900f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
Quantization made by Richard Erkhov.

[Github](https://github.com/RichardErkhov)

[Discord](https://discord.gg/pvy7H8DZMG)

[Request more models](https://github.com/RichardErkhov/quant_request)


pair-preference-model-LLaMA3-8B - GGUF
- Model creator: https://huggingface.co/RLHFlow/
- Original model: https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B/


| Name | Quant method | Size |
| ---- | ---- | ---- |
| [pair-preference-model-LLaMA3-8B.Q2_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q2_K.gguf) | Q2_K | 2.96GB |
| [pair-preference-model-LLaMA3-8B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_XS.gguf) | IQ3_XS | 3.28GB |
| [pair-preference-model-LLaMA3-8B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_S.gguf) | IQ3_S | 3.43GB |
| [pair-preference-model-LLaMA3-8B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_S.gguf) | Q3_K_S | 3.41GB |
| [pair-preference-model-LLaMA3-8B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ3_M.gguf) | IQ3_M | 3.52GB |
| [pair-preference-model-LLaMA3-8B.Q3_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K.gguf) | Q3_K | 3.74GB |
| [pair-preference-model-LLaMA3-8B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_M.gguf) | Q3_K_M | 3.74GB |
| [pair-preference-model-LLaMA3-8B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q3_K_L.gguf) | Q3_K_L | 4.03GB |
| [pair-preference-model-LLaMA3-8B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ4_XS.gguf) | IQ4_XS | 4.18GB |
| [pair-preference-model-LLaMA3-8B.Q4_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_0.gguf) | Q4_0 | 4.34GB |
| [pair-preference-model-LLaMA3-8B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.IQ4_NL.gguf) | IQ4_NL | 4.38GB |
| [pair-preference-model-LLaMA3-8B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K_S.gguf) | Q4_K_S | 4.37GB |
| [pair-preference-model-LLaMA3-8B.Q4_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K.gguf) | Q4_K | 4.58GB |
| [pair-preference-model-LLaMA3-8B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_K_M.gguf) | Q4_K_M | 4.58GB |
| [pair-preference-model-LLaMA3-8B.Q4_1.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q4_1.gguf) | Q4_1 | 4.78GB |
| [pair-preference-model-LLaMA3-8B.Q5_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_0.gguf) | Q5_0 | 5.21GB |
| [pair-preference-model-LLaMA3-8B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K_S.gguf) | Q5_K_S | 5.21GB |
| [pair-preference-model-LLaMA3-8B.Q5_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K.gguf) | Q5_K | 5.34GB |
| [pair-preference-model-LLaMA3-8B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_K_M.gguf) | Q5_K_M | 5.34GB |
| [pair-preference-model-LLaMA3-8B.Q5_1.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q5_1.gguf) | Q5_1 | 5.65GB |
| [pair-preference-model-LLaMA3-8B.Q6_K.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q6_K.gguf) | Q6_K | 6.14GB |
| [pair-preference-model-LLaMA3-8B.Q8_0.gguf](https://huggingface.co/RichardErkhov/RLHFlow_-_pair-preference-model-LLaMA3-8B-gguf/blob/main/pair-preference-model-LLaMA3-8B.Q8_0.gguf) | Q8_0 | 7.95GB |




Original model description:
---
license: llama3
---
This preference model is trained from [LLaMA3-8B-it](meta-llama/Meta-Llama-3-8B-Instruct) with the training script at [Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/pm_dev/pair-pm).

The dataset is RLHFlow/pair_preference_model_dataset. It achieves Chat-98.6, Char-hard 65.8, Safety 89.6, and reasoning 94.9 in reward bench.

See our paper [RLHF Workflow: From Reward Modeling to Online RLHF](https://arxiv.org/abs/2405.07863) for more details of this model.

## Service the RM

Here is an example to use the Preference Model to rank a pair. For n>2 responses, it is recommened to use the tournament style ranking strategy to get the best response so that the complexity is linear in n.

```python
device = 0

model = AutoModelForCausalLM.from_pretrained(script_args.preference_name_or_path,
                                             torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2").cuda()
tokenizer = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
tokenizer_plain = AutoTokenizer.from_pretrained(script_args.preference_name_or_path, use_fast=True)
tokenizer_plain.chat_template = "\n{% for message in messages %}{% if loop.index0 % 2 == 0 %}\n\n<turn> user\n {{ message['content'] }}{% else %}\n\n<turn> assistant\n {{ message['content'] }}{% endif %}{% endfor %}\n\n\n"

prompt_template = "[CONTEXT] {context} [RESPONSE A] {response_A} [RESPONSE B] {response_B} \n"
token_id_A = tokenizer.encode("A", add_special_tokens=False)
token_id_B = tokenizer.encode("B", add_special_tokens=False)
assert len(token_id_A) == 1 and len(token_id_B) == 1
token_id_A = token_id_A[0]
token_id_B = token_id_B[0]
temperature = 1.0


model.eval()
response_chosen = "BBBB"
response_rejected = "CCCC"

## We can also handle multi-turn conversation.
instruction = [{"role": "user", "content": ...},
{"role": "assistant", "content": ...},
{"role": "user", "content": ...},
]
context = tokenizer_plain.apply_chat_template(instruction, tokenize=False)
responses = [response_chosen, response_rejected]
probs_chosen = []
    
for chosen_position in [0, 1]:
    # we swap order to mitigate position bias
    response_A = responses[chosen_position]
    response_B = responses[1 - chosen_position]
    prompt = prompt_template.format(context=context, response_A=response_A, response_B=response_B)
    message = [
        {"role": "user", "content": prompt},
    ]

    input_ids = tokenizer.encode(tokenizer.apply_chat_template(message, tokenize=False).replace(tokenizer.bos_token, ""), return_tensors='pt', add_special_tokens=False).cuda() 

    with torch.no_grad():
        output = model(input_ids)
    logit_A = output.logits[0, -1, token_id_A].item()
    logit_B = output.logits[0, -1, token_id_B].item()
    # take softmax to get the probability; using numpy
    Z = np.exp(logit_A / temperature) + np.exp(logit_B / temperature)
    logit_chosen = [logit_A, logit_B][chosen_position]
    prob_chosen = np.exp(logit_chosen / temperature) / Z
    probs_chosen.append(prob_chosen)

avg_prob_chosen = np.mean(probs_chosen)
correct = 0.5 if avg_prob_chosen == 0.5 else float(avg_prob_chosen > 0.5)
print(correct)
```

## Citation
If you use this model in your research, please consider citing our paper
```
@misc{rlhflow,
      title={RLHF Workflow: From Reward Modeling to Online RLHF}, 
      author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
      year={2024},
      eprint={2405.07863},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```
and Google's Slic paper (which initially proposes this pairwise preference model)
```
@article{zhao2023slic,
  title={Slic-hf: Sequence likelihood calibration with human feedback},
  author={Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J},
  journal={arXiv preprint arXiv:2305.10425},
  year={2023}
}
```