ritikk commited on
Commit
dd9460b
1 Parent(s): 04dc1b0

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. .ipynb_checkpoints/README-checkpoint.md +38 -20
  2. README.md +48 -29
.ipynb_checkpoints/README-checkpoint.md CHANGED
@@ -1,20 +1,40 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
3
  language:
4
  - en
5
  pipeline_tag: text-generation
6
  inference: false
7
  tags:
8
- - mistral
9
- - pytorch
10
  - inferentia2
11
  - neuron
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
- # Neuronx model for Mistral
14
 
15
- This repository contains [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) and [`neuronx`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) compatible checkpoints for [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).
16
 
17
- However, this file includes an example of how to compile various versions of Mistral. Support isn’t available yet (as of 1/3/2024) in the optimum-neuron framework, so we use the base transformers library.
18
 
19
  These instructions closely follow the [Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#grouped-query-attention-gqa-support-beta). Look there for more detailed explanations, especially for the GQA settings.
20
 
@@ -32,7 +52,7 @@ python -m pip install git+https://github.com/aws-neuron/transformers-neuronx.git
32
 
33
  ## Running inference from this repository
34
 
35
- If you want to run a quick test or if the exact model you want to use is [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1), you can run it directly using the steps below. Otherwise, jump to the Compilation of other Mistral versions section.
36
 
37
  First, you will need a local copy of the library. This is because one of the nice things that the Hugging Face optimum library does is abstract local loads from repository loads. However, Mistral inference isn't supported yet.
38
 
@@ -41,11 +61,11 @@ From python:
41
  ```
42
  # using python instead of git clone because I know this supports lfs on the DLAMI image
43
  from huggingface_hub import Repository
44
- repo = Repository(local_dir="Mistral-neuron", clone_from="aws-neuron/Mistral-neuron")
45
 
46
  ```
47
 
48
- This should put a local copy in Mistral-neuron. This process should take a 5-10 minutes. If it completes in a few seconds the first time you run it, you are having problems with git-lfs. You can see this by using ls -al to check the size of the files downloaded. You will also notice it later when you get parsing errors.
49
 
50
  Next, load the model and neff files from disk into the Neuron processors:
51
 
@@ -53,7 +73,6 @@ Next, load the model and neff files from disk into the Neuron processors:
53
  import torch
54
  from transformers_neuronx import constants
55
  from transformers_neuronx.mistral.model import MistralForSampling
56
- from transformers_neuronx.module import save_pretrained_split
57
  from transformers_neuronx.config import NeuronConfig
58
  from transformers import AutoModelForCausalLM, AutoTokenizer
59
 
@@ -61,19 +80,20 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
61
  neuron_config = NeuronConfig(
62
  grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
63
  )
 
64
  # define the model. These are the settings used in compilation.
65
  # If you want to change these settings, skip to "Compilation of other Mistral versions"
66
- model_neuron = MistralForSampling.from_pretrained("Mistral-neuron", batch_size=1, tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
 
67
 
68
  # load the neff files from the local directory instead of compiling
69
- model_neuron.load("Mistral-neuron")
70
 
71
  # load the neff files into the neuron processors.
72
  # you can see this process happening if you run neuron-top from the command line in another console.
73
  # if you didn't do the previous load command, this will also compile the neff files
74
  model_neuron.to_neuron()
75
 
76
-
77
  ```
78
 
79
  ## Inference example
@@ -82,8 +102,10 @@ This points to the original model for the tokenizer because the tokenizer is the
82
  If you are compiling your own and want to have a single reference for everything, you can copy the special_tokens_map.json and tokenizer* from the original model to your local copy.
83
 
84
  ```
85
- # Get a tokenizer and example input. Note that this points to the original model
86
- tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1')
 
 
87
  text = "[INST] What is your favourite condiment? [/INST]"
88
  encoded_input = tokenizer(text, return_tensors='pt')
89
 
@@ -96,13 +118,9 @@ with torch.inference_mode():
96
 
97
 
98
  Example output:
99
- (most of the time with amp=‘bf16’, the answer is ketchup. However, if I compiled with amp=f32, the answer was soy sauce. This was for a sample size of one, so let me know what you see —@jburtoft)
100
 
101
  ```
102
- 2024-Jan-03 15:59:21.0510 1486:2057 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
103
- 2024-Jan-03 15:59:21.0510 1486:2057 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
104
- ['<s> [INST] What is your favourite condiment? [/INST] My favorite condiment is probably ketchup. It adds a perfect balance of sweet, tangy, and slightly spicy flavor to dishes, and is versatile enough to go with a wide variety of foods.</s>']
105
-
106
  ```
107
 
108
  ## Compilation of other Mistral versions
 
1
  ---
2
+ base_model: HuggingFaceH4/zephyr-7b-beta
3
+ datasets:
4
+ - HuggingFaceH4/ultrachat_200k
5
+ - HuggingFaceH4/ultrafeedback_binarized
6
+ license: mit
7
  language:
8
  - en
9
  pipeline_tag: text-generation
10
  inference: false
11
  tags:
12
+ - generated_from_trainer
 
13
  - inferentia2
14
  - neuron
15
+ model-index:
16
+ - name: zephyr-7b-beta
17
+ results: []
18
+ model_creator: Hugging Face H4
19
+ model_name: Zephyr 7B Beta
20
+ model_type: mistral
21
+ prompt_template: '<|system|>
22
+
23
+ </s>
24
+
25
+ <|user|>
26
+
27
+ {prompt}</s>
28
+
29
+ <|assistant|>
30
+
31
+ '
32
  ---
33
+ # Neuronx model for Zephyr-7b-beta
34
 
35
+ This repository contains [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) and [`neuronx`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) compatible checkpoints for [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).
36
 
37
+ However, this file includes an example of how to compile various versions of Zephyr. Support isn’t available yet (as of 1/9/2024) in the optimum-neuron framework, so we use the base transformers library.
38
 
39
  These instructions closely follow the [Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#grouped-query-attention-gqa-support-beta). Look there for more detailed explanations, especially for the GQA settings.
40
 
 
52
 
53
  ## Running inference from this repository
54
 
55
+ If you want to run a quick test or if the exact model you want to use is [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), you can run it directly using the steps below. Otherwise, jump to the Compilation of other Mistral versions section.
56
 
57
  First, you will need a local copy of the library. This is because one of the nice things that the Hugging Face optimum library does is abstract local loads from repository loads. However, Mistral inference isn't supported yet.
58
 
 
61
  ```
62
  # using python instead of git clone because I know this supports lfs on the DLAMI image
63
  from huggingface_hub import Repository
64
+ repo = Repository(local_dir="zephyr-7b-beta-neuron", clone_from="ritikk/zephyr-7b-beta-neuron")
65
 
66
  ```
67
 
68
+ This should put a local copy in zephyr-7b-beta-neuron. This process should take a 5-10 minutes. If it completes in a few seconds the first time you run it, you are likely having problems with git-lfs. You can see this by using ls -al to check the size of the files downloaded. You will also notice it later when you get parsing errors.
69
 
70
  Next, load the model and neff files from disk into the Neuron processors:
71
 
 
73
  import torch
74
  from transformers_neuronx import constants
75
  from transformers_neuronx.mistral.model import MistralForSampling
 
76
  from transformers_neuronx.config import NeuronConfig
77
  from transformers import AutoModelForCausalLM, AutoTokenizer
78
 
 
80
  neuron_config = NeuronConfig(
81
  grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
82
  )
83
+
84
  # define the model. These are the settings used in compilation.
85
  # If you want to change these settings, skip to "Compilation of other Mistral versions"
86
+ model_neuron = MistralForSampling.from_pretrained("zephyr-7b-beta-neuron", batch_size=1, \
87
+ tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
88
 
89
  # load the neff files from the local directory instead of compiling
90
+ model_neuron.load("zephyr-7b-beta-neuron")
91
 
92
  # load the neff files into the neuron processors.
93
  # you can see this process happening if you run neuron-top from the command line in another console.
94
  # if you didn't do the previous load command, this will also compile the neff files
95
  model_neuron.to_neuron()
96
 
 
97
  ```
98
 
99
  ## Inference example
 
102
  If you are compiling your own and want to have a single reference for everything, you can copy the special_tokens_map.json and tokenizer* from the original model to your local copy.
103
 
104
  ```
105
+ # Get a tokenizer and example input. This points to original tokenizer.
106
+ # tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
107
+ # this refers to tokenizer from local copy
108
+ tokenizer = AutoTokenzier.from_pretrained("zephyr-7b-beta-neuron")
109
  text = "[INST] What is your favourite condiment? [/INST]"
110
  encoded_input = tokenizer(text, return_tensors='pt')
111
 
 
118
 
119
 
120
  Example output:
 
121
 
122
  ```
123
+ ["<s> [INST] What is your favourite condiment? [/INST]\nHere's a little script to test people's favorite condiment.\n\nYou can do this with paper cones and have people guess what's in it, but they need to write their guess on a piece of of paper and put it in a jar before they take a bite.\n\nIn this version, we have ketchup, mustard,mayonnaise,bbq sauce, and relish.\n\nThe script is straightforward, so as long as your bottle isn’t too tiny, you can add to the bottom of the script,or re-shape the form of the script a bit.\n\nIf you put their guesses in a jar before they take a bite,you can put all their guesses in the jar as soon as they're done,and show the container as they guess.\nAs for removing lines from the script,you'll probably be removing the ones from the bottom of the script,or adding lines to the top of of the script.\nIf for no matter reason your bottle is too tiny to set all the guesses in,you can write their guesses on cards or bits of paper,and set"]
 
 
 
124
  ```
125
 
126
  ## Compilation of other Mistral versions
README.md CHANGED
@@ -1,20 +1,40 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
3
  language:
4
  - en
5
  pipeline_tag: text-generation
6
  inference: false
7
  tags:
8
- - mistral
9
- - pytorch
10
  - inferentia2
11
  - neuron
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
- # Neuronx model for Mistral
14
 
15
- This repository contains [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) and [`neuronx`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) compatible checkpoints for [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).
16
 
17
- However, this file includes an example of how to compile various versions of Mistral. Support isn’t available yet (as of 1/3/2024) in the optimum-neuron framework, so we use the base transformers library.
18
 
19
  These instructions closely follow the [Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#grouped-query-attention-gqa-support-beta). Look there for more detailed explanations, especially for the GQA settings.
20
 
@@ -32,7 +52,7 @@ python -m pip install git+https://github.com/aws-neuron/transformers-neuronx.git
32
 
33
  ## Running inference from this repository
34
 
35
- If you want to run a quick test or if the exact model you want to use is [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1), you can run it directly using the steps below. Otherwise, jump to the Compilation of other Mistral versions section.
36
 
37
  First, you will need a local copy of the library. This is because one of the nice things that the Hugging Face optimum library does is abstract local loads from repository loads. However, Mistral inference isn't supported yet.
38
 
@@ -41,11 +61,11 @@ From python:
41
  ```
42
  # using python instead of git clone because I know this supports lfs on the DLAMI image
43
  from huggingface_hub import Repository
44
- repo = Repository(local_dir="Mistral-neuron", clone_from="aws-neuron/Mistral-neuron")
45
 
46
  ```
47
 
48
- This should put a local copy in Mistral-neuron. This process should take a 5-10 minutes. If it completes in a few seconds the first time you run it, you are having problems with git-lfs. You can see this by using ls -al to check the size of the files downloaded. You will also notice it later when you get parsing errors.
49
 
50
  Next, load the model and neff files from disk into the Neuron processors:
51
 
@@ -53,7 +73,6 @@ Next, load the model and neff files from disk into the Neuron processors:
53
  import torch
54
  from transformers_neuronx import constants
55
  from transformers_neuronx.mistral.model import MistralForSampling
56
- from transformers_neuronx.module import save_pretrained_split
57
  from transformers_neuronx.config import NeuronConfig
58
  from transformers import AutoModelForCausalLM, AutoTokenizer
59
 
@@ -61,19 +80,20 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
61
  neuron_config = NeuronConfig(
62
  grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
63
  )
 
64
  # define the model. These are the settings used in compilation.
65
  # If you want to change these settings, skip to "Compilation of other Mistral versions"
66
- model_neuron = MistralForSampling.from_pretrained("Mistral-neuron", batch_size=1, tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
 
67
 
68
  # load the neff files from the local directory instead of compiling
69
- model_neuron.load("Mistral-neuron")
70
 
71
  # load the neff files into the neuron processors.
72
  # you can see this process happening if you run neuron-top from the command line in another console.
73
  # if you didn't do the previous load command, this will also compile the neff files
74
  model_neuron.to_neuron()
75
 
76
-
77
  ```
78
 
79
  ## Inference example
@@ -82,8 +102,10 @@ This points to the original model for the tokenizer because the tokenizer is the
82
  If you are compiling your own and want to have a single reference for everything, you can copy the special_tokens_map.json and tokenizer* from the original model to your local copy.
83
 
84
  ```
85
- # Get a tokenizer and example input. Note that this points to the original model
86
- tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1')
 
 
87
  text = "[INST] What is your favourite condiment? [/INST]"
88
  encoded_input = tokenizer(text, return_tensors='pt')
89
 
@@ -96,18 +118,14 @@ with torch.inference_mode():
96
 
97
 
98
  Example output:
99
- (most of the time with amp=‘bf16’, the answer is ketchup. However, if I compiled with amp=f32, the answer was soy sauce. This was for a sample size of one, so let me know what you see —@jburtoft)
100
 
101
  ```
102
- 2024-Jan-03 15:59:21.0510 1486:2057 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
103
- 2024-Jan-03 15:59:21.0510 1486:2057 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
104
- ['<s> [INST] What is your favourite condiment? [/INST] My favorite condiment is probably ketchup. It adds a perfect balance of sweet, tangy, and slightly spicy flavor to dishes, and is versatile enough to go with a wide variety of foods.</s>']
105
-
106
  ```
107
 
108
  ## Compilation of other Mistral versions
109
 
110
- If you want to use a different version of Mistral from Hugging Face, use the slightly modified code below. It essentially removes the “load” command. When the “to_neuron()” command sees that the model object doesn’t include the neff files, it will kick off the recompile. You can save them at the end so you only have to do the compilation process once. After that, you can use the code above to load a model and the neff files from the local directory.
111
 
112
  ```
113
  import torch
@@ -117,11 +135,13 @@ from transformers_neuronx.module import save_pretrained_split
117
  from transformers_neuronx.config import NeuronConfig
118
  from transformers import AutoModelForCausalLM, AutoTokenizer
119
 
 
 
120
  # Load and save the CPU model with bfloat16 casting. This also gives us a local copy
121
- # change the Hugging Face model name (mistralai/Mistral-7B-Instruct-v0.1) below to what you want
122
  # You can update the other model names if you want, but they just reference a directory on the local disk.
123
- model_cpu = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1')
124
- save_pretrained_split(model_cpu, 'mistralai/Mistral-7B-Instruct-v0.1-split')
125
 
126
  # Set sharding strategy for GQA to be shard over heads
127
  neuron_config = NeuronConfig(
@@ -129,13 +149,12 @@ neuron_config = NeuronConfig(
129
  )
130
 
131
  # Create and compile the Neuron model
132
- model_neuron = MistralForSampling.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1-split', batch_size=1, \
133
  tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
134
  model_neuron.to_neuron()
135
 
136
  #save compiled neff files out to the same directory
137
- model_neuron.save("mistralai/Mistral-7B-Instruct-v0.1-split")
138
-
139
 
140
  ```
141
 
@@ -143,13 +162,13 @@ model_neuron.save("mistralai/Mistral-7B-Instruct-v0.1-split")
143
 
144
  ## Arguments passed during compilation
145
 
146
- The settings use in compilation are the same as shown above in the code. If you want to change these, you will need to recompile. If you don’t want to pass them in each time, you could update the config.json file. This is another nice thing the Hugging Face optimum framework does for us. You can see an example of the format by looking at one of the Llama model config.json files. For [example](https://huggingface.co/aws-neuron/Llama-2-7b-hf-neuron-latency/blob/main/config.json).
147
 
148
  ```
149
  neuron_config = NeuronConfig(
150
  grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
151
  )
152
- ("Mistral-neuron", batch_size=1, tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
153
 
154
  ```
155
 
 
1
  ---
2
+ base_model: HuggingFaceH4/zephyr-7b-beta
3
+ datasets:
4
+ - HuggingFaceH4/ultrachat_200k
5
+ - HuggingFaceH4/ultrafeedback_binarized
6
+ license: mit
7
  language:
8
  - en
9
  pipeline_tag: text-generation
10
  inference: false
11
  tags:
12
+ - generated_from_trainer
 
13
  - inferentia2
14
  - neuron
15
+ model-index:
16
+ - name: zephyr-7b-beta
17
+ results: []
18
+ model_creator: Hugging Face H4
19
+ model_name: Zephyr 7B Beta
20
+ model_type: mistral
21
+ prompt_template: '<|system|>
22
+
23
+ </s>
24
+
25
+ <|user|>
26
+
27
+ {prompt}</s>
28
+
29
+ <|assistant|>
30
+
31
+ '
32
  ---
33
+ # Neuronx model for Zephyr-7b-beta
34
 
35
+ This repository contains [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) and [`neuronx`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) compatible checkpoints for [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).
36
 
37
+ However, this file includes an example of how to compile various versions of Zephyr. Support isn’t available yet (as of 1/9/2024) in the optimum-neuron framework, so we use the base transformers library.
38
 
39
  These instructions closely follow the [Developer Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#grouped-query-attention-gqa-support-beta). Look there for more detailed explanations, especially for the GQA settings.
40
 
 
52
 
53
  ## Running inference from this repository
54
 
55
+ If you want to run a quick test or if the exact model you want to use is [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), you can run it directly using the steps below. Otherwise, jump to the Compilation of other Mistral versions section.
56
 
57
  First, you will need a local copy of the library. This is because one of the nice things that the Hugging Face optimum library does is abstract local loads from repository loads. However, Mistral inference isn't supported yet.
58
 
 
61
  ```
62
  # using python instead of git clone because I know this supports lfs on the DLAMI image
63
  from huggingface_hub import Repository
64
+ repo = Repository(local_dir="zephyr-7b-beta-neuron", clone_from="ritikk/zephyr-7b-beta-neuron")
65
 
66
  ```
67
 
68
+ This should put a local copy in zephyr-7b-beta-neuron. This process should take a 5-10 minutes. If it completes in a few seconds the first time you run it, you are likely having problems with git-lfs. You can see this by using ls -al to check the size of the files downloaded. You will also notice it later when you get parsing errors.
69
 
70
  Next, load the model and neff files from disk into the Neuron processors:
71
 
 
73
  import torch
74
  from transformers_neuronx import constants
75
  from transformers_neuronx.mistral.model import MistralForSampling
 
76
  from transformers_neuronx.config import NeuronConfig
77
  from transformers import AutoModelForCausalLM, AutoTokenizer
78
 
 
80
  neuron_config = NeuronConfig(
81
  grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
82
  )
83
+
84
  # define the model. These are the settings used in compilation.
85
  # If you want to change these settings, skip to "Compilation of other Mistral versions"
86
+ model_neuron = MistralForSampling.from_pretrained("zephyr-7b-beta-neuron", batch_size=1, \
87
+ tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
88
 
89
  # load the neff files from the local directory instead of compiling
90
+ model_neuron.load("zephyr-7b-beta-neuron")
91
 
92
  # load the neff files into the neuron processors.
93
  # you can see this process happening if you run neuron-top from the command line in another console.
94
  # if you didn't do the previous load command, this will also compile the neff files
95
  model_neuron.to_neuron()
96
 
 
97
  ```
98
 
99
  ## Inference example
 
102
  If you are compiling your own and want to have a single reference for everything, you can copy the special_tokens_map.json and tokenizer* from the original model to your local copy.
103
 
104
  ```
105
+ # Get a tokenizer and example input. This points to original tokenizer.
106
+ # tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")
107
+ # this refers to tokenizer from local copy
108
+ tokenizer = AutoTokenzier.from_pretrained("zephyr-7b-beta-neuron")
109
  text = "[INST] What is your favourite condiment? [/INST]"
110
  encoded_input = tokenizer(text, return_tensors='pt')
111
 
 
118
 
119
 
120
  Example output:
 
121
 
122
  ```
123
+ ["<s> [INST] What is your favourite condiment? [/INST]\nHere's a little script to test people's favorite condiment.\n\nYou can do this with paper cones and have people guess what's in it, but they need to write their guess on a piece of of paper and put it in a jar before they take a bite.\n\nIn this version, we have ketchup, mustard,mayonnaise,bbq sauce, and relish.\n\nThe script is straightforward, so as long as your bottle isn’t too tiny, you can add to the bottom of the script,or re-shape the form of the script a bit.\n\nIf you put their guesses in a jar before they take a bite,you can put all their guesses in the jar as soon as they're done,and show the container as they guess.\nAs for removing lines from the script,you'll probably be removing the ones from the bottom of the script,or adding lines to the top of of the script.\nIf for no matter reason your bottle is too tiny to set all the guesses in,you can write their guesses on cards or bits of paper,and set"]
 
 
 
124
  ```
125
 
126
  ## Compilation of other Mistral versions
127
 
128
+ If you want to use a different version of Mistral or Zephyr from Hugging Face, use the slightly modified code below. It essentially removes the “load” command. When the “to_neuron()” command sees that the model object doesn’t include the neff files, it will kick off the recompile. You can save them at the end so you only have to do the compilation process once. After that, you can use the code above to load a model and the neff files from the local directory.
129
 
130
  ```
131
  import torch
 
135
  from transformers_neuronx.config import NeuronConfig
136
  from transformers import AutoModelForCausalLM, AutoTokenizer
137
 
138
+ model_id="HuggingFaceH4/zephyr-7b-beta"
139
+
140
  # Load and save the CPU model with bfloat16 casting. This also gives us a local copy
141
+ # change the Hugging Face model name (HuggingFaceH4/zephyr-7b-beta) below to what you want
142
  # You can update the other model names if you want, but they just reference a directory on the local disk.
143
+ model_cpu = AutoModelForCausalLM.from_pretrained(model_id)
144
+ save_pretrained_split(model_cpu, model_id)
145
 
146
  # Set sharding strategy for GQA to be shard over heads
147
  neuron_config = NeuronConfig(
 
149
  )
150
 
151
  # Create and compile the Neuron model
152
+ model_neuron = MistralForSampling.from_pretrained(model_id, batch_size=1, \
153
  tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
154
  model_neuron.to_neuron()
155
 
156
  #save compiled neff files out to the same directory
157
+ model_neuron.save("HuggingFaceH4/zephyr-7b-beta")
 
158
 
159
  ```
160
 
 
162
 
163
  ## Arguments passed during compilation
164
 
165
+ The settings use in compilation are the same as shown above in the code. If you want to change these, you will need to recompile. If you don’t want to pass them in each time, you could update the config.json file. This is another nice thing the Hugging Face optimum neuron framework does for us. You can see an example of the format by looking at one of the Llama model config.json files. For [example](https://huggingface.co/aws-neuron/Llama-2-7b-hf-neuron-latency/blob/main/config.json).
166
 
167
  ```
168
  neuron_config = NeuronConfig(
169
  grouped_query_attention=constants.GQA.SHARD_OVER_HEADS
170
  )
171
+ ("zephyr-7b-beta-neuron", batch_size=1, tp_degree=2, n_positions=256, amp='bf16', neuron_config=neuron_config)
172
 
173
  ```
174