macadeliccc
commited on
Commit
•
49ede63
1
Parent(s):
999bee5
Update README.md
Browse files
README.md
CHANGED
@@ -36,4 +36,98 @@ TODO
|
|
36 |
|
37 |
## Evaluations
|
38 |
|
39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
## Evaluations
|
38 |
|
39 |
+
## EQ-Bench
|
40 |
+
|
41 |
+
<pre>----Benchmark Complete----
|
42 |
+
2024-01-30 15:22:18
|
43 |
+
Time taken: 145.9 mins
|
44 |
+
Prompt Format: ChatML
|
45 |
+
Model: macadeliccc/MBX-7B-v3-DPO
|
46 |
+
Score (v2): 74.32
|
47 |
+
Parseable: 166.0
|
48 |
+
---------------
|
49 |
+
Batch completed
|
50 |
+
Time taken: 145.9 mins
|
51 |
+
---------------
|
52 |
+
</pre>
|
53 |
+
|
54 |
+
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|
55 |
+
|-----------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|
56 |
+
|[MBX-7B-v3-DPO](https://huggingface.co/macadeliccc/MBX-7B-v3-DPO)| 45.16| 77.73| 74.62| 48.83| 61.58|
|
57 |
+
|
58 |
+
### AGIEval
|
59 |
+
| Task |Version| Metric |Value| |Stderr|
|
60 |
+
|------------------------------|------:|--------|----:|---|-----:|
|
61 |
+
|agieval_aqua_rat | 0|acc |27.95|± | 2.82|
|
62 |
+
| | |acc_norm|26.77|± | 2.78|
|
63 |
+
|agieval_logiqa_en | 0|acc |41.01|± | 1.93|
|
64 |
+
| | |acc_norm|40.55|± | 1.93|
|
65 |
+
|agieval_lsat_ar | 0|acc |25.65|± | 2.89|
|
66 |
+
| | |acc_norm|23.91|± | 2.82|
|
67 |
+
|agieval_lsat_lr | 0|acc |50.78|± | 2.22|
|
68 |
+
| | |acc_norm|52.94|± | 2.21|
|
69 |
+
|agieval_lsat_rc | 0|acc |66.54|± | 2.88|
|
70 |
+
| | |acc_norm|65.80|± | 2.90|
|
71 |
+
|agieval_sat_en | 0|acc |77.67|± | 2.91|
|
72 |
+
| | |acc_norm|77.67|± | 2.91|
|
73 |
+
|agieval_sat_en_without_passage| 0|acc |43.20|± | 3.46|
|
74 |
+
| | |acc_norm|43.20|± | 3.46|
|
75 |
+
|agieval_sat_math | 0|acc |32.27|± | 3.16|
|
76 |
+
| | |acc_norm|30.45|± | 3.11|
|
77 |
+
|
78 |
+
Average: 45.16%
|
79 |
+
|
80 |
+
### GPT4All
|
81 |
+
| Task |Version| Metric |Value| |Stderr|
|
82 |
+
|-------------|------:|--------|----:|---|-----:|
|
83 |
+
|arc_challenge| 0|acc |68.43|± | 1.36|
|
84 |
+
| | |acc_norm|68.34|± | 1.36|
|
85 |
+
|arc_easy | 0|acc |87.54|± | 0.68|
|
86 |
+
| | |acc_norm|82.11|± | 0.79|
|
87 |
+
|boolq | 1|acc |88.20|± | 0.56|
|
88 |
+
|hellaswag | 0|acc |69.76|± | 0.46|
|
89 |
+
| | |acc_norm|87.40|± | 0.33|
|
90 |
+
|openbookqa | 0|acc |40.20|± | 2.19|
|
91 |
+
| | |acc_norm|49.60|± | 2.24|
|
92 |
+
|piqa | 0|acc |83.68|± | 0.86|
|
93 |
+
| | |acc_norm|85.36|± | 0.82|
|
94 |
+
|winogrande | 0|acc |83.11|± | 1.05|
|
95 |
+
|
96 |
+
Average: 77.73%
|
97 |
+
|
98 |
+
### TruthfulQA
|
99 |
+
| Task |Version|Metric|Value| |Stderr|
|
100 |
+
|-------------|------:|------|----:|---|-----:|
|
101 |
+
|truthfulqa_mc| 1|mc1 |58.87|± | 1.72|
|
102 |
+
| | |mc2 |74.62|± | 1.44|
|
103 |
+
|
104 |
+
Average: 74.62%
|
105 |
+
|
106 |
+
### Bigbench
|
107 |
+
| Task |Version| Metric |Value| |Stderr|
|
108 |
+
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
109 |
+
|bigbench_causal_judgement | 0|multiple_choice_grade|60.00|± | 3.56|
|
110 |
+
|bigbench_date_understanding | 0|multiple_choice_grade|63.14|± | 2.51|
|
111 |
+
|bigbench_disambiguation_qa | 0|multiple_choice_grade|47.67|± | 3.12|
|
112 |
+
|bigbench_geometric_shapes | 0|multiple_choice_grade|22.56|± | 2.21|
|
113 |
+
| | |exact_str_match | 0.84|± | 0.48|
|
114 |
+
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|33.20|± | 2.11|
|
115 |
+
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|23.00|± | 1.59|
|
116 |
+
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|59.67|± | 2.84|
|
117 |
+
|bigbench_movie_recommendation | 0|multiple_choice_grade|47.40|± | 2.24|
|
118 |
+
|bigbench_navigate | 0|multiple_choice_grade|56.10|± | 1.57|
|
119 |
+
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|71.25|± | 1.01|
|
120 |
+
|bigbench_ruin_names | 0|multiple_choice_grade|56.47|± | 2.35|
|
121 |
+
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|35.27|± | 1.51|
|
122 |
+
|bigbench_snarks | 0|multiple_choice_grade|73.48|± | 3.29|
|
123 |
+
|bigbench_sports_understanding | 0|multiple_choice_grade|75.46|± | 1.37|
|
124 |
+
|bigbench_temporal_sequences | 0|multiple_choice_grade|52.10|± | 1.58|
|
125 |
+
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.64|± | 1.18|
|
126 |
+
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|19.83|± | 0.95|
|
127 |
+
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|59.67|± | 2.84|
|
128 |
+
|
129 |
+
Average: 48.83%
|
130 |
+
|
131 |
+
Average score: 61.58%
|
132 |
+
|
133 |
+
Elapsed time: 02:37:39
|