macadeliccc commited on
Commit
49ede63
1 Parent(s): 999bee5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -1
README.md CHANGED
@@ -36,4 +36,98 @@ TODO
36
 
37
  ## Evaluations
38
 
39
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Evaluations
38
 
39
+ ## EQ-Bench
40
+
41
+ <pre>----Benchmark Complete----
42
+ 2024-01-30 15:22:18
43
+ Time taken: 145.9 mins
44
+ Prompt Format: ChatML
45
+ Model: macadeliccc/MBX-7B-v3-DPO
46
+ Score (v2): 74.32
47
+ Parseable: 166.0
48
+ ---------------
49
+ Batch completed
50
+ Time taken: 145.9 mins
51
+ ---------------
52
+ </pre>
53
+
54
+ | Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
55
+ |-----------------------------------------------------------------|------:|------:|---------:|-------:|------:|
56
+ |[MBX-7B-v3-DPO](https://huggingface.co/macadeliccc/MBX-7B-v3-DPO)| 45.16| 77.73| 74.62| 48.83| 61.58|
57
+
58
+ ### AGIEval
59
+ | Task |Version| Metric |Value| |Stderr|
60
+ |------------------------------|------:|--------|----:|---|-----:|
61
+ |agieval_aqua_rat | 0|acc |27.95|± | 2.82|
62
+ | | |acc_norm|26.77|± | 2.78|
63
+ |agieval_logiqa_en | 0|acc |41.01|± | 1.93|
64
+ | | |acc_norm|40.55|± | 1.93|
65
+ |agieval_lsat_ar | 0|acc |25.65|± | 2.89|
66
+ | | |acc_norm|23.91|± | 2.82|
67
+ |agieval_lsat_lr | 0|acc |50.78|± | 2.22|
68
+ | | |acc_norm|52.94|± | 2.21|
69
+ |agieval_lsat_rc | 0|acc |66.54|± | 2.88|
70
+ | | |acc_norm|65.80|± | 2.90|
71
+ |agieval_sat_en | 0|acc |77.67|± | 2.91|
72
+ | | |acc_norm|77.67|± | 2.91|
73
+ |agieval_sat_en_without_passage| 0|acc |43.20|± | 3.46|
74
+ | | |acc_norm|43.20|± | 3.46|
75
+ |agieval_sat_math | 0|acc |32.27|± | 3.16|
76
+ | | |acc_norm|30.45|± | 3.11|
77
+
78
+ Average: 45.16%
79
+
80
+ ### GPT4All
81
+ | Task |Version| Metric |Value| |Stderr|
82
+ |-------------|------:|--------|----:|---|-----:|
83
+ |arc_challenge| 0|acc |68.43|± | 1.36|
84
+ | | |acc_norm|68.34|± | 1.36|
85
+ |arc_easy | 0|acc |87.54|± | 0.68|
86
+ | | |acc_norm|82.11|± | 0.79|
87
+ |boolq | 1|acc |88.20|± | 0.56|
88
+ |hellaswag | 0|acc |69.76|± | 0.46|
89
+ | | |acc_norm|87.40|± | 0.33|
90
+ |openbookqa | 0|acc |40.20|± | 2.19|
91
+ | | |acc_norm|49.60|± | 2.24|
92
+ |piqa | 0|acc |83.68|± | 0.86|
93
+ | | |acc_norm|85.36|± | 0.82|
94
+ |winogrande | 0|acc |83.11|± | 1.05|
95
+
96
+ Average: 77.73%
97
+
98
+ ### TruthfulQA
99
+ | Task |Version|Metric|Value| |Stderr|
100
+ |-------------|------:|------|----:|---|-----:|
101
+ |truthfulqa_mc| 1|mc1 |58.87|± | 1.72|
102
+ | | |mc2 |74.62|± | 1.44|
103
+
104
+ Average: 74.62%
105
+
106
+ ### Bigbench
107
+ | Task |Version| Metric |Value| |Stderr|
108
+ |------------------------------------------------|------:|---------------------|----:|---|-----:|
109
+ |bigbench_causal_judgement | 0|multiple_choice_grade|60.00|± | 3.56|
110
+ |bigbench_date_understanding | 0|multiple_choice_grade|63.14|± | 2.51|
111
+ |bigbench_disambiguation_qa | 0|multiple_choice_grade|47.67|± | 3.12|
112
+ |bigbench_geometric_shapes | 0|multiple_choice_grade|22.56|± | 2.21|
113
+ | | |exact_str_match | 0.84|± | 0.48|
114
+ |bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|33.20|± | 2.11|
115
+ |bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|23.00|± | 1.59|
116
+ |bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|59.67|± | 2.84|
117
+ |bigbench_movie_recommendation | 0|multiple_choice_grade|47.40|± | 2.24|
118
+ |bigbench_navigate | 0|multiple_choice_grade|56.10|± | 1.57|
119
+ |bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|71.25|± | 1.01|
120
+ |bigbench_ruin_names | 0|multiple_choice_grade|56.47|± | 2.35|
121
+ |bigbench_salient_translation_error_detection | 0|multiple_choice_grade|35.27|± | 1.51|
122
+ |bigbench_snarks | 0|multiple_choice_grade|73.48|± | 3.29|
123
+ |bigbench_sports_understanding | 0|multiple_choice_grade|75.46|± | 1.37|
124
+ |bigbench_temporal_sequences | 0|multiple_choice_grade|52.10|± | 1.58|
125
+ |bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.64|± | 1.18|
126
+ |bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|19.83|± | 0.95|
127
+ |bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|59.67|± | 2.84|
128
+
129
+ Average: 48.83%
130
+
131
+ Average score: 61.58%
132
+
133
+ Elapsed time: 02:37:39