Post
2918
🔥 Prometheus 2 was recently released by Kaist AI as an alternative and closely mirroring both human and GPT-4 evaluation, and surpassing the former Prometheus!
prometheus-eval/prometheus-7b-v2.0
prometheus-eval/prometheus-8x7b-v2.0
🌬️Fine-tuned on top of mistralai/Mistral-7B-Instruct-v0.2 and mistralai/Mixtral-8x7B-Instruct-v0.1
🗂️The datasets used for fine-tuning have been publicly released i.e. prometheus-eval/Feedback-Collection and prometheus-eval/Preference-Collection
🤝🏻Unified LM evaluator for absolute (a single prompt-completion pair) and relative (two completions for a given prompt) due to model merging
❌No longer needs a mandatory reference / golden answer, but can still be provided optionally
🔝Surpasses the former version of Prometheus, and has a high correlation with human, GPT-4, and Claude 3 Opus scores when evaluating LMs
📝Apache 2.0 license
Long-story short, an amazing job from Kaist AI bridging the gap with LLM evaluators other than proprietary and bigger models!
This week at Argilla, we decided to add a new task to use Prometheus 2 as an LLM evaluator using
😱 Using
Find the generated dataset and the code at distilabel-internal-testing/instruction-dataset-prometheus
prometheus-eval/prometheus-7b-v2.0
prometheus-eval/prometheus-8x7b-v2.0
🌬️Fine-tuned on top of mistralai/Mistral-7B-Instruct-v0.2 and mistralai/Mixtral-8x7B-Instruct-v0.1
🗂️The datasets used for fine-tuning have been publicly released i.e. prometheus-eval/Feedback-Collection and prometheus-eval/Preference-Collection
🤝🏻Unified LM evaluator for absolute (a single prompt-completion pair) and relative (two completions for a given prompt) due to model merging
❌No longer needs a mandatory reference / golden answer, but can still be provided optionally
🔝Surpasses the former version of Prometheus, and has a high correlation with human, GPT-4, and Claude 3 Opus scores when evaluating LMs
📝Apache 2.0 license
Long-story short, an amazing job from Kaist AI bridging the gap with LLM evaluators other than proprietary and bigger models!
This week at Argilla, we decided to add a new task to use Prometheus 2 as an LLM evaluator using
distilabel
, so we implemented PrometheusEval
.😱 Using
PrometheusEval
running their 7B variant with vLLM in a single L40 on top of
HuggingFaceH4/instruction-dataset, we got the 327 existing prompt-completion pairs evaluated and pushed to the Hub in less than 2 minutes!Find the generated dataset and the code at distilabel-internal-testing/instruction-dataset-prometheus