Description

mistral-7b-sft-beta model finetuned by hybrid WPO (GPT-4-turbo + on-policy sampling + Ultrafeedback). Details in WPO: Enhancing RLHF with Weighted Preference Optimization. The training data is wzhouad/zephyr-ultrafeedback-hybrid.

License

This model is licensed under the Zoom software license and is permitted for use only for noncommercial, educational, or academic research purposes.

Downloads last month: 8

Safetensors

Model size

7.24B params

Tensor type

F32

Inference Examples

Text Generation

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train wzhouad/zephyr-7B-WPO-HB

Collection including wzhouad/zephyr-7B-WPO-HB

WPO

Collection

Models and datasets in paper "WPO: Enhancing RLHF with Weighted Preference Optimization". • 11 items • Updated Aug 22 • 4