google
/

flan-ul2

@@ -80,7 +80,7 @@ license: apache-2.0
 Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2)  released earlier last year. It was fine tuned using the "Flan" prompt tuning
 and dataset collection.
-According ot the original [blog](https://www.yitay.net/blog/flan-ul2-20b) here are the notable improvements:
 - The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
 - The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
 - The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
@@ -89,7 +89,7 @@ According ot the original [blog](https://www.yitay.net/blog/flan-ul2-20b) here a
 ## Converting from T5x to huggingface
-You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `stric=False` argument.
 ```bash
 python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --config_file PATH_TO_CONFIG --pytorch_dump_path PATH_TO_SAVE
 ```
@@ -181,7 +181,7 @@ with a batch size of 1024. The sequence length is set to 512/512 for inputs and
 Dropout is set to 0 during pretraining. Pre-training took slightly more than one month for about 1 trillion
 tokens. The model has 32 encoder layers and 32 decoder layers, `dmodel` of 4096 and `df` of 16384.
 The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8.
-The same same sentencepiece tokenizer as T5 of vocab size 32000 is used (click [here](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5#transformers.T5Tokenizer) for more information about the T5 tokenizer).
 UL-20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
 UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](https://github.com/google-research/t5x) infrastructure.

 Flan-UL2 is an encoder decoder model based on the `T5` architecture. It uses the same configuration as the [`UL2 model`](https://huggingface.co/google/ul2)  released earlier last year. It was fine tuned using the "Flan" prompt tuning
 and dataset collection.
+According to the original [blog](https://www.yitay.net/blog/flan-ul2-20b) here are the notable improvements:
 - The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large.
 - The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning.
 - The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore.
 ## Converting from T5x to huggingface
+You can use the [`convert_t5x_checkpoint_to_pytorch.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/convert_t5x_checkpoint_to_pytorch.py) script and pass the argument `strict = False`. The final layer norm is missing from the original dictionnary, that is why we are passing the `strict = False` argument.
 ```bash
 python convert_t5x_checkpoint_to_pytorch.py --t5x_checkpoint_path PATH_TO_T5X_CHECKPOINTS --config_file PATH_TO_CONFIG --pytorch_dump_path PATH_TO_SAVE
 ```
 Dropout is set to 0 during pretraining. Pre-training took slightly more than one month for about 1 trillion
 tokens. The model has 32 encoder layers and 32 decoder layers, `dmodel` of 4096 and `df` of 16384.
 The dimension of each head is 256 for a total of 16 heads. Our model uses a model parallelism of 8.
+The same sentencepiece tokenizer as T5 of vocab size 32000 is used (click [here](https://huggingface.co/docs/transformers/v4.20.0/en/model_doc/t5#transformers.T5Tokenizer) for more information about the T5 tokenizer).
 UL-20B can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.
 UL-20B was trained using the [Jax](https://github.com/google/jax) and [T5X](https://github.com/google-research/t5x) infrastructure.