mayank-mishra commited on
Commit
dfb7ef8
1 Parent(s): 6208e73

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -120,8 +120,10 @@ model-index:
120
 
121
  ## Model Summary
122
  PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
 
123
 
124
  ## Usage
 
125
 
126
  ### Generation
127
  This is a simple example of how to use **PowerMoE-3b** model.
 
120
 
121
  ## Model Summary
122
  PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
123
+ Paper: https://arxiv.org/abs/2408.13359
124
 
125
  ## Usage
126
+ Note: requires a custom branch of transformers: https://github.com/mayank31398/transformers/tree/granitemoe
127
 
128
  ### Generation
129
  This is a simple example of how to use **PowerMoE-3b** model.