Testing Might be broken
Collection
testing only models,
•
10 items
•
Updated
•
1
Another trial of merging models with different sizes, still under testing, should be more stable, but I have no ideia if it's improving or degrading the base model.
Recipe:
merge_method: task_anysize
base_model: princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
models:
- model: senseable/WestLake-7B-v2
parameters:
weight: 1.0
dtype: bfloat16
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 41.16 |
AI2 Reasoning Challenge (25-Shot) | 39.76 |
HellaSwag (10-Shot) | 70.33 |
MMLU (5-Shot) | 26.81 |
TruthfulQA (0-shot) | 46.50 |
Winogrande (5-shot) | 63.54 |
GSM8k (5-shot) | 0.00 |