File size: 10,242 Bytes
0dcab31 4a39413 b3b11de 888025d c4c87ff f7b7e4b 9eee4fe 3b0be67 ae8fdcc 9eee4fe e386947 33efe92 c9af653 33efe92 e386947 ae8fdcc 9eee4fe ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc c4c87ff ae8fdcc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
datasets:
- xzuyn/chatdoctor-200k-stripped
- Technoculture/riddle_sense
- axiong/pmc_llama_instructions
- Open-Orca/SlimOrca-Dedup
language:
- en
tags:
- medical
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/63486df1f8f01fcc4b23e97d/nMuS3Qnb5m0dENIixWv0q.png)
[Technoculture/MT7Bi-alpha](https://huggingface.co/Technoculture/MT7Bi-alpha) adapter merged with its Base Model (Meditron 7B)
# Evaluations
## Open LLM Leaderboard
| Model | ARC |HellaSwag|TruthfulQA|Winogrande|GSM8K|
|---------------------------------------------------|----:|--------:|---------:|---------:|----:|
|[MT7Bi-sft (epoch 4)](https://huggingface.co/Technoculture/MT7Bi-sft)|54.1| 75.11| 43.08| 72.14|15.54|
|[MT7Bi-sft (epoch 1)](https://huggingface.co/Technoculture/MT7Bi)|50.94| 73.24| 43.04| 72.06|22.52|
### Model Evaluation Benchmark
| | | | | | | | | |
| -------- | ------ |----- |----- |----- |----- |----- |----- |------ |
|Category | MT7Bi | meditron-70b | llama-2-70b | med42-70b* | meditron-7b | llama-2-7b | PMC-llama-7b |
|Health | | 81.8 | 69.1 | 83.6 | 27.3 | 16.4 | 3.6 |
|Nutrition | | 77.9 | 68.8 | 62.5 | 31.1 | 12.5 | 6.3 |
|Psychology| | 47.4 | 36.8 | 52.6 | 21.1 | 10.5 | 0.0 |
|Science | | 77.8 | 44.4 | 33.3 | 33.3 | 11.1 | 0.0 |
|Avg | | 71.2 | 54.8 | 58.0 | 28.3 | 12.6 | 2.5 |
| | | | | | | | |
| | | | | | | |
| --- | ------ | ------ |----- |----- |----- |----- |
|Dataset| MT7Bi | meditron-70b | llama-2-70b | med42-70b* | clinical-camel-70b* |
|MMLU-Medical | 46.9 | 77.6 | 77.9 | 74.5 | 65.7 |
|PubMedQA | 65.2 | 81.6 | 80.0 | 61.2 | 67.0 |
|MedMCQA | 42.7 | 66.0 | 62.6 | 59.2 | 46.7 |
|MedQA | | 64.4 | 61.5 | 59.1 | 50.8 |
|MedQA-4-Option| 44.3 | 70.2 | 63.8 | 63.9 | 56.8 |
|Avg | | 72.0 | 69.2 | 63.6 | 57.4 |
| | | | | | | |
| | | | | | | |
| --- | ------ |----- |----- |----- |----- |------ |
|Dataset | meditron-7b | llama-2-7b | pmc-llama-7b | Zephyr-7B-beta* | Mistral-7B-instruct* | MT7Bi |
|MMLU-Medical | 54.2 | 53.7 | 56.4 | 63.3 | 60.0 | 46.9 |
|PubMedQA | 74.4 | 61.8 | 59.2 | 46.0 | 17.8 | 65.2 |
|MedMCQA | 59.2 | 54.4 | 57.6 | 43.0 | 40.2 | 42.7 |
|MedQA | 47.9 | 44.0 | 42.4 | 42.8 | 32.4 | |
|MedQA-4-Option| 52.0 | 49.6 | 49.2 | 48.5 | 41.1 | 44.3 |
|Avg | 57.5 | 52.7 | 53.0 | 48.7 | 38.3 | |
| | | | | | | |
| Model Name | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
| ------------------ | -------- | --------- | ---- | ---------- | ---------- | -------- |
| Orca-2-7b | **78.4** | 76.1 | 53.7 | **52.4** | **74.2** | **47.2** |
| LLAMA-2-7b | 43.2 | **77.1** | 44.4 | 38.7 | 69.5 | 16 |
| MT7Bi-sft | 54.1 | 75.11 | - | 43.08 | 72.14 | 15.54 |
### ARC: 54.1%
| Task |Version| Metric | Value | |Stderr|
|-------------|------:|--------------------|-------------|---|------|
|arc_challenge| 1|acc,none | 0.51| | |
| | |acc_stderr,none | 0.01| | |
| | |acc_norm,none | 0.54| | |
| | |acc_norm_stderr,none| 0.01| | |
| | |alias |arc_challenge| | |
### HellaSwag: 75.11%
| Task |Version| Metric | Value | |Stderr|
|---------|------:|--------------------|---------|---|------|
|hellaswag| 1|acc,none | 0.57| | |
| | |acc_stderr,none | 0| | |
| | |acc_norm,none | 0.75| | |
| | |acc_norm_stderr,none| 0| | |
| | |alias |hellaswag| | |
### TruthfulQA: 43.08%
| Task |Version| Metric | Value | |Stderr|
|--------------|-------|-----------------------|-----------------|---|------|
|truthfulqa |N/A |bleu_max,none | 18.31| | |
| | |bleu_max_stderr,none | 0.46| | |
| | |bleu_acc,none | 0.39| | |
| | |bleu_acc_stderr,none | 0| | |
| | |bleu_diff,none | -1.63| | |
| | |bleu_diff_stderr,none | 0.39| | |
| | |rouge1_max,none | 41.99| | |
| | |rouge1_max_stderr,none | 0.71| | |
| | |rouge1_acc,none | 0.39| | |
| | |rouge1_acc_stderr,none | 0| | |
| | |rouge1_diff,none | -2.88| | |
| | |rouge1_diff_stderr,none| 0.66| | |
| | |rouge2_max,none | 27.42| | |
| | |rouge2_max_stderr,none | 0.80| | |
| | |rouge2_acc,none | 0.32| | |
| | |rouge2_acc_stderr,none | 0| | |
| | |rouge2_diff,none | -3.11| | |
| | |rouge2_diff_stderr,none| 0.78| | |
| | |rougeL_max,none | 38.81| | |
| | |rougeL_max_stderr,none | 0.71| | |
| | |rougeL_acc,none | 0.38| | |
| | |rougeL_acc_stderr,none | 0| | |
| | |rougeL_diff,none | -3.01| | |
| | |rougeL_diff_stderr,none| 0.66| | |
| | |acc,none | 0.33| | |
| | |acc_stderr,none | 0.05| | |
| | |alias |truthfulqa | | |
|truthfulqa_gen| 3|bleu_max,none | 18.31| | |
| | |bleu_max_stderr,none | 0.68| | |
| | |bleu_acc,none | 0.39| | |
| | |bleu_acc_stderr,none | 0.02| | |
| | |bleu_diff,none | -1.63| | |
| | |bleu_diff_stderr,none | 0.62| | |
| | |rouge1_max,none | 41.99| | |
| | |rouge1_max_stderr,none | 0.84| | |
| | |rouge1_acc,none | 0.39| | |
| | |rouge1_acc_stderr,none | 0.02| | |
| | |rouge1_diff,none | -2.88| | |
| | |rouge1_diff_stderr,none| 0.81| | |
| | |rouge2_max,none | 27.42| | |
| | |rouge2_max_stderr,none | 0.89| | |
| | |rouge2_acc,none | 0.32| | |
| | |rouge2_acc_stderr,none | 0.02| | |
| | |rouge2_diff,none | -3.11| | |
| | |rouge2_diff_stderr,none| 0.88| | |
| | |rougeL_max,none | 38.81| | |
| | |rougeL_max_stderr,none | 0.84| | |
| | |rougeL_acc,none | 0.38| | |
| | |rougeL_acc_stderr,none | 0.02| | |
| | |rougeL_diff,none | -3.01| | |
| | |rougeL_diff_stderr,none| 0.82| | |
| | |alias | - truthfulqa_gen| | |
|truthfulqa_mc1| 2|acc,none | 0.28| | |
| | |acc_stderr,none | 0.02| | |
| | |alias | - truthfulqa_mc1| | |
|truthfulqa_mc2| 2|acc,none | 0.43| | |
| | |acc_stderr,none | 0.01| | |
| | |alias | - truthfulqa_mc2| | |
### Winogrande: 72.14%
| Task |Version| Metric | Value | |Stderr|
|----------|------:|---------------|----------|---|------|
|winogrande| 1|acc,none | 0.72| | |
| | |acc_stderr,none| 0.01| | |
| | |alias |winogrande| | |
### GSM8K: 15.54%
|Task |Version| Metric |Value| |Stderr|
|-----|------:|-----------------------------|-----|---|------|
|gsm8k| 2|exact_match,get-answer | 0.16| | |
| | |exact_match_stderr,get-answer| 0.01| | |
| | |alias |gsm8k| | |
Elapsed time: 04:06:36 |