ibm
/

PowerMoE-3b

mayank-mishra commited on Aug 28

Commit

dfb7ef8

•

1 Parent(s): 6208e73

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -120,8 +120,10 @@ model-index:
 ## Model Summary
 PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
 ## Usage
 ### Generation
 This is a simple example of how to use **PowerMoE-3b** model.

 ## Model Summary
 PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
+Paper: https://arxiv.org/abs/2408.13359
 ## Usage
+Note: requires a custom branch of transformers: https://github.com/mayank31398/transformers/tree/granitemoe
 ### Generation
 This is a simple example of how to use **PowerMoE-3b** model.