Would the team consider releasing the reward model in addition to the trained model? Reward model could be very useful for evaluating the performance of generation, and could also make it easier for others to reproduce RLHF training.
Β· Sign up or log in to comment