Update README.md
Browse files
README.md
CHANGED
@@ -94,6 +94,34 @@ The dataset is comprised of a mixture of open datasets large-scale datasets avai
|
|
94 |
| Claude 2 | - |RLHF |8.06| 91.36|
|
95 |
| GPT-4 | -| RLHF |8.99| 95.28|
|
96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
### Training Infrastructure
|
98 |
|
99 |
* **Hardware**: `Stable Zephyr 3B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes.
|
|
|
94 |
| Claude 2 | - |RLHF |8.06| 91.36|
|
95 |
| GPT-4 | -| RLHF |8.99| 95.28|
|
96 |
|
97 |
+
## Other benchmark:
|
98 |
+
- BigBench: 0.3526.
|
99 |
+
|
100 |
+
|
101 |
+
| Task | Version | Metric | Value | Stderr |
|
102 |
+
|-----------------------------------------------------|---------|-------------------------|-------|--------|
|
103 |
+
| bigbench_causal_judgement | 0 | multiple_choice_grade | 0.5316| 0.0363 |
|
104 |
+
| bigbench_date_understanding | 0 | multiple_choice_grade | 0.4363| 0.0259 |
|
105 |
+
| bigbench_disambiguation_qa | 0 | multiple_choice_grade | 0.3217| 0.0291 |
|
106 |
+
| bigbench_dyck_languages | 0 | multiple_choice_grade | 0.1450| 0.0111 |
|
107 |
+
| bigbench_formal_fallacies_syllogisms_negation | 0 | multiple_choice_grade | 0.4982| 0.0042 |
|
108 |
+
| bigbench_geometric_shapes | 0 | multiple_choice_grade | 0.1086| 0.0164 |
|
109 |
+
| bigbench_hyperbaton | 0 | exact_str_match | 0.0000| 0.0000 |
|
110 |
+
| bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 0.5232| 0.0022 |
|
111 |
+
| bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 0.2480| 0.0193 |
|
112 |
+
| bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 0.1814| 0.0146 |
|
113 |
+
| bigbench_movie_recommendation | 0 | multiple_choice_grade | 0.4067| 0.0284 |
|
114 |
+
| bigbench_navigate | 0 | multiple_choice_grade | 0.2580| 0.0196 |
|
115 |
+
| bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 0.5990| 0.0155 |
|
116 |
+
| bigbench_ruin_names | 0 | multiple_choice_grade | 0.4370| 0.0111 |
|
117 |
+
| bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 0.3951| 0.0231 |
|
118 |
+
| bigbench_snarks | 0 | multiple_choice_grade | 0.2265| 0.0133 |
|
119 |
+
| bigbench_sports_understanding | 0 | multiple_choice_grade | 0.6464| 0.0356 |
|
120 |
+
| bigbench_temporal_sequences | 0 | multiple_choice_grade | 0.5091| 0.0159 |
|
121 |
+
| bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 0.2680| 0.0140 |
|
122 |
+
| bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 0.1856| 0.0110 |
|
123 |
+
| bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 0.1269| 0.0080 |
|
124 |
+
|
125 |
### Training Infrastructure
|
126 |
|
127 |
* **Hardware**: `Stable Zephyr 3B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes.
|