Training language models to follow instructions with human feedback
Paper
•
2203.02155
•
Published
•
15
Note
RLHF
Direct Preference Optimization: Your Language Model is Secretly a Reward
Model
Paper
•
2305.18290
•
Published
•
48
Note
DPO, Stanford, Chelsea Finn's Team
Statistical Rejection Sampling Improves Preference Optimization
Paper
•
2309.06657
•
Published
•
13
Note
Rejection Sampling
SimPO: Simple Preference Optimization with a Reference-Free Reward
Paper
•
2405.14734
•
Published
•
10
Note
SimPO, Princeton, Prof. Danqi Chen's Team
Weak-to-Strong Extrapolation Expedites Alignment
Paper
•
2404.16792
•
Published
•
11
Note
ExPO, the authors claimed ExPO doesn't need model training and tuning
Could improve ~0.4 point on MT Bench
Self-Instruct: Aligning Language Model with Self Generated Instructions
Paper
•
2212.10560
•
Published
•
8
Note
Self-Instruct
WizardLM: Empowering Large Language Models to Follow Complex
Instructions
Paper
•
2304.12244
•
Published
•
13
Note
Evol-Instruct