File size: 10,242 Bytes
0dcab31
 
 
 
 
 
 
 
 
 
 
4a39413
b3b11de
 
 
888025d
c4c87ff
f7b7e4b
9eee4fe
3b0be67
ae8fdcc
 
 
 
9eee4fe
 
 
e386947
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33efe92
 
 
 
 
 
 
c9af653
 
33efe92
 
e386947
 
 
 
 
ae8fdcc
9eee4fe
ae8fdcc
c4c87ff
ae8fdcc
 
c4c87ff
ae8fdcc
c4c87ff
 
 
ae8fdcc
c4c87ff
ae8fdcc
 
c4c87ff
ae8fdcc
c4c87ff
 
 
ae8fdcc
c4c87ff
 
ae8fdcc
 
 
c4c87ff
ae8fdcc
 
 
 
 
c4c87ff
ae8fdcc
 
 
 
 
c4c87ff
ae8fdcc
 
 
 
 
c4c87ff
ae8fdcc
 
c4c87ff
 
 
ae8fdcc
 
 
c4c87ff
ae8fdcc
 
 
 
 
c4c87ff
ae8fdcc
 
 
 
 
c4c87ff
ae8fdcc
 
 
 
 
c4c87ff
ae8fdcc
 
c4c87ff
ae8fdcc
c4c87ff
 
ae8fdcc
c4c87ff
 
 
ae8fdcc
c4c87ff
ae8fdcc
 
c4c87ff
 
 
ae8fdcc
c4c87ff
ae8fdcc
 
c4c87ff
 
 
ae8fdcc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
datasets:
- xzuyn/chatdoctor-200k-stripped
- Technoculture/riddle_sense
- axiong/pmc_llama_instructions
- Open-Orca/SlimOrca-Dedup
language:
- en
tags:
- medical
---


![image/png](https://cdn-uploads.huggingface.co/production/uploads/63486df1f8f01fcc4b23e97d/nMuS3Qnb5m0dENIixWv0q.png)

[Technoculture/MT7Bi-alpha](https://huggingface.co/Technoculture/MT7Bi-alpha) adapter merged with its Base Model (Meditron 7B)

# Evaluations

## Open LLM Leaderboard
|                       Model                       | ARC |HellaSwag|TruthfulQA|Winogrande|GSM8K|
|---------------------------------------------------|----:|--------:|---------:|---------:|----:|
|[MT7Bi-sft (epoch 4)](https://huggingface.co/Technoculture/MT7Bi-sft)|54.1|    75.11| 43.08|     72.14|15.54|
|[MT7Bi-sft (epoch 1)](https://huggingface.co/Technoculture/MT7Bi)|50.94|    73.24| 43.04|     72.06|22.52|

### Model Evaluation Benchmark

|          |        |      |      |      |      |      |      |       |
| -------- | ------ |----- |----- |----- |----- |----- |----- |------ |
|Category  | MT7Bi  | meditron-70b | llama-2-70b | med42-70b* | meditron-7b | llama-2-7b | PMC-llama-7b |
|Health    |  |      81.8    |    69.1     |    83.6    | 27.3        |   16.4     |      3.6     |
|Nutrition |  |      77.9    |    68.8     |    62.5    | 31.1        |   12.5     |      6.3     |
|Psychology|  |      47.4    |    36.8     |    52.6    | 21.1        |   10.5     |      0.0     |
|Science   |  |      77.8    |    44.4     |    33.3    | 33.3        |   11.1     |      0.0     |
|Avg       |  |      71.2    |    54.8     |    58.0    | 28.3        |   12.6     |      2.5     |
|          |  |              |             |            |             |            |              |


|     |        |        |      |      |      |      |
| --- | ------ | ------ |----- |----- |----- |----- |
|Dataset| MT7Bi |     meditron-70b | llama-2-70b | med42-70b* | clinical-camel-70b* |
|MMLU-Medical | 46.9 | 77.6       |    77.9     | 74.5      | 65.7               |
|PubMedQA     | 65.2 | 81.6       |    80.0     | 61.2      | 67.0               |
|MedMCQA      | 42.7 | 66.0       |    62.6     | 59.2      | 46.7               |
|MedQA        |  | 64.4       |    61.5     | 59.1      | 50.8               |
|MedQA-4-Option| 44.3 | 70.2      |    63.8     | 63.9      | 56.8               |
|Avg          |  | 72.0       |    69.2     | 63.6      | 57.4               |
|     |        |  |      |      |      |      |

|     |        |      |      |      |      |       |
| --- | ------ |----- |----- |----- |----- |------ |
|Dataset       | meditron-7b | llama-2-7b  | pmc-llama-7b | Zephyr-7B-beta* | Mistral-7B-instruct* | MT7Bi |
|MMLU-Medical  | 54.2        |    53.7     |    56.4      |      63.3       |        60.0          | 46.9  |
|PubMedQA      | 74.4        |    61.8     |    59.2      |      46.0       |        17.8          | 65.2  |
|MedMCQA       | 59.2        |    54.4     |    57.6      |      43.0       |        40.2          | 42.7  |
|MedQA         | 47.9        |    44.0     |    42.4      |      42.8       |        32.4          |  |
|MedQA-4-Option| 52.0        |    49.6     |    49.2      |      48.5       |        41.1          | 44.3 |
|Avg           | 57.5        |    52.7     |    53.0      |      48.7       |        38.3          |  |
|              |             |             |              |                 |                      |  |


| Model Name         | ARC      | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K    |
| ------------------ | -------- | --------- | ---- | ---------- | ---------- | -------- |
| Orca-2-7b          | **78.4** | 76.1      | 53.7 | **52.4**   | **74.2**   | **47.2** |
| LLAMA-2-7b         | 43.2     | **77.1**  | 44.4 | 38.7       | 69.5       | 16       |
| MT7Bi-sft    | 54.1    | 75.11     | -    | 43.08      | 72.14      | 15.54    |

### ARC: 54.1%
|    Task     |Version|       Metric       |    Value    |   |Stderr|
|-------------|------:|--------------------|-------------|---|------|
|arc_challenge|      1|acc,none            |         0.51|   |      |
|             |       |acc_stderr,none     |         0.01|   |      |
|             |       |acc_norm,none       |         0.54|   |      |
|             |       |acc_norm_stderr,none|         0.01|   |      |
|             |       |alias               |arc_challenge|   |      |

### HellaSwag: 75.11%
|  Task   |Version|       Metric       |  Value  |   |Stderr|
|---------|------:|--------------------|---------|---|------|
|hellaswag|      1|acc,none            |     0.57|   |      |
|         |       |acc_stderr,none     |        0|   |      |
|         |       |acc_norm,none       |     0.75|   |      |
|         |       |acc_norm_stderr,none|        0|   |      |
|         |       |alias               |hellaswag|   |      |

### TruthfulQA: 43.08%
|     Task     |Version|        Metric         |      Value      |   |Stderr|
|--------------|-------|-----------------------|-----------------|---|------|
|truthfulqa    |N/A    |bleu_max,none          |            18.31|   |      |
|              |       |bleu_max_stderr,none   |             0.46|   |      |
|              |       |bleu_acc,none          |             0.39|   |      |
|              |       |bleu_acc_stderr,none   |                0|   |      |
|              |       |bleu_diff,none         |            -1.63|   |      |
|              |       |bleu_diff_stderr,none  |             0.39|   |      |
|              |       |rouge1_max,none        |            41.99|   |      |
|              |       |rouge1_max_stderr,none |             0.71|   |      |
|              |       |rouge1_acc,none        |             0.39|   |      |
|              |       |rouge1_acc_stderr,none |                0|   |      |
|              |       |rouge1_diff,none       |            -2.88|   |      |
|              |       |rouge1_diff_stderr,none|             0.66|   |      |
|              |       |rouge2_max,none        |            27.42|   |      |
|              |       |rouge2_max_stderr,none |             0.80|   |      |
|              |       |rouge2_acc,none        |             0.32|   |      |
|              |       |rouge2_acc_stderr,none |                0|   |      |
|              |       |rouge2_diff,none       |            -3.11|   |      |
|              |       |rouge2_diff_stderr,none|             0.78|   |      |
|              |       |rougeL_max,none        |            38.81|   |      |
|              |       |rougeL_max_stderr,none |             0.71|   |      |
|              |       |rougeL_acc,none        |             0.38|   |      |
|              |       |rougeL_acc_stderr,none |                0|   |      |
|              |       |rougeL_diff,none       |            -3.01|   |      |
|              |       |rougeL_diff_stderr,none|             0.66|   |      |
|              |       |acc,none               |             0.33|   |      |
|              |       |acc_stderr,none        |             0.05|   |      |
|              |       |alias                  |truthfulqa       |   |      |
|truthfulqa_gen|      3|bleu_max,none          |            18.31|   |      |
|              |       |bleu_max_stderr,none   |             0.68|   |      |
|              |       |bleu_acc,none          |             0.39|   |      |
|              |       |bleu_acc_stderr,none   |             0.02|   |      |
|              |       |bleu_diff,none         |            -1.63|   |      |
|              |       |bleu_diff_stderr,none  |             0.62|   |      |
|              |       |rouge1_max,none        |            41.99|   |      |
|              |       |rouge1_max_stderr,none |             0.84|   |      |
|              |       |rouge1_acc,none        |             0.39|   |      |
|              |       |rouge1_acc_stderr,none |             0.02|   |      |
|              |       |rouge1_diff,none       |            -2.88|   |      |
|              |       |rouge1_diff_stderr,none|             0.81|   |      |
|              |       |rouge2_max,none        |            27.42|   |      |
|              |       |rouge2_max_stderr,none |             0.89|   |      |
|              |       |rouge2_acc,none        |             0.32|   |      |
|              |       |rouge2_acc_stderr,none |             0.02|   |      |
|              |       |rouge2_diff,none       |            -3.11|   |      |
|              |       |rouge2_diff_stderr,none|             0.88|   |      |
|              |       |rougeL_max,none        |            38.81|   |      |
|              |       |rougeL_max_stderr,none |             0.84|   |      |
|              |       |rougeL_acc,none        |             0.38|   |      |
|              |       |rougeL_acc_stderr,none |             0.02|   |      |
|              |       |rougeL_diff,none       |            -3.01|   |      |
|              |       |rougeL_diff_stderr,none|             0.82|   |      |
|              |       |alias                  | - truthfulqa_gen|   |      |
|truthfulqa_mc1|      2|acc,none               |             0.28|   |      |
|              |       |acc_stderr,none        |             0.02|   |      |
|              |       |alias                  | - truthfulqa_mc1|   |      |
|truthfulqa_mc2|      2|acc,none               |             0.43|   |      |
|              |       |acc_stderr,none        |             0.01|   |      |
|              |       |alias                  | - truthfulqa_mc2|   |      |

### Winogrande: 72.14%
|   Task   |Version|    Metric     |  Value   |   |Stderr|
|----------|------:|---------------|----------|---|------|
|winogrande|      1|acc,none       |      0.72|   |      |
|          |       |acc_stderr,none|      0.01|   |      |
|          |       |alias          |winogrande|   |      |

### GSM8K: 15.54%
|Task |Version|           Metric            |Value|   |Stderr|
|-----|------:|-----------------------------|-----|---|------|
|gsm8k|      2|exact_match,get-answer       | 0.16|   |      |
|     |       |exact_match_stderr,get-answer| 0.01|   |      |
|     |       |alias                        |gsm8k|   |      |

Elapsed time: 04:06:36