Muennighoff
commited on
Commit
β’
f6558ef
1
Parent(s):
3746e8e
Update README.md
Browse files
README.md
CHANGED
@@ -22,50 +22,6 @@ base_model: allenai/OLMoE-1B-7B-0924-SFT
|
|
22 |
- Paper:
|
23 |
- Logs: https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt
|
24 |
|
25 |
-
### Evaluation Summary
|
26 |
-
|
27 |
-
| Task (β) | MMLU | GSM8k | BBH | Human-Eval | Alpaca-Eval 1.0 | XSTest | IFEval | Avg |
|
28 |
-
|---------------|------|-------|------|------------|-----------------|--------|--------|------|
|
29 |
-
| **Setup (β)** | 0-shot | 8-shot CoT | 3-shot | 0-shot | 0-shot | 0-shot | 0-shot | |
|
30 |
-
| **Metric (β)** | EM | EM | EM | Pass@10 | %win | F1 | Loose Acc | |
|
31 |
-
| | | | | | | | | |
|
32 |
-
| OLMo-1B (0724) | 25.0 | 7.0 | 22.5 | 16.0 | - | 67.6 | 20.5 | - |
|
33 |
-
| +SFT | 36.0 | 12.5 | 27.2 | 21.2 | 41.5 | 81.9 | 26.1 | 35.9 |
|
34 |
-
| +DPO | 36.7 | 12.5 | 30.6 | 22.0 | 50.9 | 79.8 | 24.2 | 37.4 |
|
35 |
-
| OLMo-7B (0724) | 50.8 | 32.5 | 36.9 | 32.3 | - | 80.8 | 19.6 | - |
|
36 |
-
| +SFT | 54.2 | 25.0 | 35.7 | 38.5 | 70.9 | 86.1 | 39.7 | 49.3 |
|
37 |
-
| +DPO | 52.8 | 9.0 | 16.6 | 35.0 | 83.5 | **87.5** | 37.9 | 49.1 |
|
38 |
-
| JetMoE-2B-9B | 45.6 | 43.0 | 37.2 | 54.6 | - | 68.2 | 20.0 | - |
|
39 |
-
| +SFT | 46.1 | 53.5 | 35.6 | 64.8 | 69.3 | 55.6 | 30.5 | 50.4 |
|
40 |
-
| DeepSeek-3B-16B | 37.7 | 18.5 | 39.4 | 48.3 | - | 65.9 | 13.5 | - |
|
41 |
-
| +Chat | 48.5 | 46.5 | **40.8** | **70.1** | 74.8 | 85.6 | 32.3 | 57.0 |
|
42 |
-
| Qwen1.5-3B-14B | **60.4** | 13.5 | 27.2 | 60.2 | - | 73.4 | 20.9 | - |
|
43 |
-
| +Chat | 58.9 | **55.5** | 21.3 | 59.7 | 83.9 | 85.6 | 36.2 | 57.3 |
|
44 |
-
| **OLMoE (This Model)** | 49.8 | 3.0 | 33.6 | 22.4 | - | 59.7 | 16.6 | - |
|
45 |
-
| **+SFT** | 51.4 | 40.5 | 38.0 | 51.6 | 69.2 | 84.1 | 43.3 | 54.0 |
|
46 |
-
| **+DPO** | 51.9 | 45.5 | 37.0 | 54.8 | **84.0** | 82.6 | **48.1** | **57.7** |
|
47 |
-
|
48 |
-
### Artifacts
|
49 |
-
|
50 |
-
- **Pretraining**
|
51 |
-
- [Checkpoints](https://hf.co/allenai/OLMoE-1B-7B-0924)
|
52 |
-
- [Code](https://github.com/allenai/OLMo/tree/Muennighoff/MoE): Built on top of OLMo models.
|
53 |
-
- [Data](https://huggingface.co/datasets/allenai/OLMoE-mix-0924): Mix of DCLM Baseline with some components of Dolma.
|
54 |
-
- Logs: *coming soon*
|
55 |
-
|
56 |
-
- **SFT (Supervised Fine-Tuning)**
|
57 |
-
- [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT): With and without load balancing.
|
58 |
-
- [Code](https://github.com/allenai/open-instruct/tree/olmoe-sft)
|
59 |
-
- [Data](https://hf.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE): Preview of Tulu 3 post-training recipe.
|
60 |
-
- [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-sft-logs.txt)
|
61 |
-
|
62 |
-
- **DPO/KTO (Direct Preference Optimization/Kahneman-Tversky Optimization)**
|
63 |
-
- [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct)
|
64 |
-
- [Preference Data](https://hf.co/datasets/allenai/ultrafeedback_binarized_cleaned)
|
65 |
-
- [DPO code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [KTO code](https://github.com/Muennighoff/kto/blob/master/kto.py)
|
66 |
-
- [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt)
|
67 |
-
|
68 |
-
|
69 |
# Use
|
70 |
|
71 |
Install `transformers` **from source** until a release after [this PR](https://github.com/huggingface/transformers/pull/32406) & `torch` and run:
|
@@ -99,6 +55,29 @@ Branches:
|
|
99 |
- `non-annealed`: Ablation starting from the `non-annealed` branch of https://hf.co/allenai/OLMoE-1B-7B-0924-SFT which is an SFT of the pretraining checkpoint prior to annealing (branch `step1200000-tokens5033B` of https://hf.co/allenai/OLMoE-1B-7B-0924)
|
100 |
- `kto`: Ablation using KTO instead of DPO. This branch is the checkpoint after 5,000 steps with the RMS optimizer. The other `kto*` branches correspond to the other checkpoints mentioned in the paper.
|
101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
# Citation
|
103 |
|
104 |
```bibtex
|
|
|
22 |
- Paper:
|
23 |
- Logs: https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
# Use
|
26 |
|
27 |
Install `transformers` **from source** until a release after [this PR](https://github.com/huggingface/transformers/pull/32406) & `torch` and run:
|
|
|
55 |
- `non-annealed`: Ablation starting from the `non-annealed` branch of https://hf.co/allenai/OLMoE-1B-7B-0924-SFT which is an SFT of the pretraining checkpoint prior to annealing (branch `step1200000-tokens5033B` of https://hf.co/allenai/OLMoE-1B-7B-0924)
|
56 |
- `kto`: Ablation using KTO instead of DPO. This branch is the checkpoint after 5,000 steps with the RMS optimizer. The other `kto*` branches correspond to the other checkpoints mentioned in the paper.
|
57 |
|
58 |
+
# Evaluation Snapshot
|
59 |
+
|
60 |
+
| Task (β) | MMLU | GSM8k | BBH | Human-Eval | Alpaca-Eval 1.0 | XSTest | IFEval | Avg |
|
61 |
+
|---------------|------|-------|------|------------|-----------------|--------|--------|------|
|
62 |
+
| **Setup (β)** | 0-shot | 8-shot CoT | 3-shot | 0-shot | 0-shot | 0-shot | 0-shot | |
|
63 |
+
| **Metric (β)** | EM | EM | EM | Pass@10 | %win | F1 | Loose Acc | |
|
64 |
+
| | | | | | | | | |
|
65 |
+
| OLMo-1B (0724) | 25.0 | 7.0 | 22.5 | 16.0 | - | 67.6 | 20.5 | - |
|
66 |
+
| +SFT | 36.0 | 12.5 | 27.2 | 21.2 | 41.5 | 81.9 | 26.1 | 35.9 |
|
67 |
+
| +DPO | 36.7 | 12.5 | 30.6 | 22.0 | 50.9 | 79.8 | 24.2 | 37.4 |
|
68 |
+
| OLMo-7B (0724) | 50.8 | 32.5 | 36.9 | 32.3 | - | 80.8 | 19.6 | - |
|
69 |
+
| +SFT | 54.2 | 25.0 | 35.7 | 38.5 | 70.9 | 86.1 | 39.7 | 49.3 |
|
70 |
+
| +DPO | 52.8 | 9.0 | 16.6 | 35.0 | 83.5 | **87.5** | 37.9 | 49.1 |
|
71 |
+
| JetMoE-2B-9B | 45.6 | 43.0 | 37.2 | 54.6 | - | 68.2 | 20.0 | - |
|
72 |
+
| +SFT | 46.1 | 53.5 | 35.6 | 64.8 | 69.3 | 55.6 | 30.5 | 50.4 |
|
73 |
+
| DeepSeek-3B-16B | 37.7 | 18.5 | 39.4 | 48.3 | - | 65.9 | 13.5 | - |
|
74 |
+
| +Chat | 48.5 | 46.5 | **40.8** | **70.1** | 74.8 | 85.6 | 32.3 | 57.0 |
|
75 |
+
| Qwen1.5-3B-14B | **60.4** | 13.5 | 27.2 | 60.2 | - | 73.4 | 20.9 | - |
|
76 |
+
| +Chat | 58.9 | **55.5** | 21.3 | 59.7 | 83.9 | 85.6 | 36.2 | 57.3 |
|
77 |
+
| **OLMoE (This Model)** | 49.8 | 3.0 | 33.6 | 22.4 | - | 59.7 | 16.6 | - |
|
78 |
+
| **+SFT** | 51.4 | 40.5 | 38.0 | 51.6 | 69.2 | 84.1 | 43.3 | 54.0 |
|
79 |
+
| **+DPO** | 51.9 | 45.5 | 37.0 | 54.8 | **84.0** | 82.6 | **48.1** | **57.7** |
|
80 |
+
|
81 |
# Citation
|
82 |
|
83 |
```bibtex
|