LiyuanLucasLiu
commited on
Commit
•
a718c05
1
Parent(s):
234a12a
link added
Browse files
README.md
CHANGED
@@ -22,9 +22,9 @@ library_name: transformers
|
|
22 |
|
23 |
- With **only 6.6B** activate parameters, GRIN MoE achieves **exceptionally good** performance across a diverse set of tasks, particularly in coding and mathematics tasks.
|
24 |
|
25 |
-
- GRIN uses **SparseMixer-v2** to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
|
26 |
|
27 |
-
- GRIN scales MoE training with **neither expert parallelism nor token dropping
|
28 |
|
29 |
## Intended Uses
|
30 |
|
|
|
22 |
|
23 |
- With **only 6.6B** activate parameters, GRIN MoE achieves **exceptionally good** performance across a diverse set of tasks, particularly in coding and mathematics tasks.
|
24 |
|
25 |
+
- GRIN uses [**SparseMixer-v2**](https://arxiv.org/html/2409.12136v1#Pt1) to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
|
26 |
|
27 |
+
- GRIN scales MoE training with [**neither expert parallelism nor token dropping**](https://arxiv.org/pdf/2409.12136#page=5.42), while the conventional MoE training employs expert parallelism and deploys token dropping.
|
28 |
|
29 |
## Intended Uses
|
30 |
|