sami-t kornfield commited on
Commit
c5e05a7
1 Parent(s): 28c9cf6

Update to Latest Mosaic Version (#2)

Browse files

- Fast forward to latest mosaic version (d18b7c70b2b270ec7055a5517448291e8f8d65b2)


Co-authored-by: K <[email protected]>

README.md CHANGED
@@ -14,23 +14,17 @@ datasets:
14
  inference: false
15
  ---
16
 
17
- ### Attribution
18
-
19
- This model is derived from [MosaicML's MPT-7B model](https://huggingface.co/mosaicml/mpt-7b/tree/main), with changes from
20
- [cekal/mpt-7b-peft-compatible](https://huggingface.co/cekal/mpt-7b-peft-compatible) applied; each licensed under the
21
- Apache License, version 2.0.
22
-
23
  # MPT-7B
24
 
25
  MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
26
  This model was trained by [MosaicML](https://www.mosaicml.com).
27
 
28
- MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
29
 
30
- These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing
31
- positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
32
- Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence.
33
- MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
34
 
35
  This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.
36
 
@@ -55,15 +49,13 @@ We demonstrate generations as long as 80k tokens on a single A100-80GB GPU in ou
55
  * License: Apache 2.0
56
 
57
  * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
58
- Built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
59
- * License: _CC-By-SA-3.0_
60
- * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
61
 
62
  * [MPT-7B-Chat](https://huggingface.co/mosaicml/mpt-7b-chat): a chatbot-like model for dialogue generation.
63
  Built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3),
64
  [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.
65
  * License: _CC-By-NC-SA-4.0_
66
- * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-chat)
67
 
68
  ## Model Date
69
 
@@ -77,7 +69,7 @@ Apache-2.0
77
 
78
  * [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
79
  * [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
80
- * Questions: Feel free to contact us via the [MosaicML Community Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-1btms90mc-GipE2ufuPkKY0QBrmF3LSA)!
81
 
82
 
83
  ## How to Use
@@ -91,37 +83,41 @@ model = transformers.AutoModelForCausalLM.from_pretrained(
91
  trust_remote_code=True
92
  )
93
  ```
94
- Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
95
  This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
96
  `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
97
 
98
- To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model with `attn_impl='triton'` and move the model to `bfloat16`:
99
  ```python
100
- config = transformers.AutoConfig.from_pretrained(
101
- 'mosaicml/mpt-7b',
102
- trust_remote_code=True
103
- )
 
 
104
  config.attn_config['attn_impl'] = 'triton'
 
105
 
106
  model = transformers.AutoModelForCausalLM.from_pretrained(
107
- 'mosaicml/mpt-7b',
108
  config=config,
109
- torch_dtype=torch.bfloat16,
110
  trust_remote_code=True
111
  )
112
- model.to(device='cuda:0')
113
  ```
114
 
115
  Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
116
 
117
  ```python
118
- config = transformers.AutoConfig.from_pretrained(
119
- 'mosaicml/mpt-7b',
120
- trust_remote_code=True
121
- )
122
- config.update({"max_seq_len": 4096})
 
 
123
  model = transformers.AutoModelForCausalLM.from_pretrained(
124
- 'mosaicml/mpt-7b',
125
  config=config,
126
  trust_remote_code=True
127
  )
@@ -131,7 +127,23 @@ This model was trained with the [EleutherAI/gpt-neox-20b](https://huggingface.co
131
 
132
  ```python
133
  from transformers import AutoTokenizer
134
- tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  ```
136
 
137
  ## Model Description
@@ -159,7 +171,7 @@ The model has been modified from a standard transformer in the following ways:
159
 
160
  ### Streaming Datasets
161
 
162
- Data was formatted using the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) library to host our data in object storage and efficiently stream it to our compute cluster during training.
163
  StreamingDataset obviates the need to download the whole dataset before starting training, and allows instant resumption of training from any point in the dataset.
164
 
165
 
@@ -184,24 +196,24 @@ The model was trained for 1T tokens (with batch size 1760 and sequence length 20
184
  Samples for each batch were selected from one of the datasets with the probability specified above.
185
  The examples were shuffled within each dataset, and each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
186
 
187
- The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. This BPE tokenizer has a number of desirable characteristics,
188
- most of which are relevant for tokenizing code:
189
- (1) It was trained on a diverse mix of data that includes code (The Pile)
190
- (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
191
- (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
192
 
193
  The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
194
 
195
  ### Training Configuration
196
 
197
- This model was trained on 440 A100-40GBs for about 9.5 days using the [MosaicML Platform](https://www.mosaicml.com/platform).
198
- The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.
199
 
200
  ## Limitations and Biases
201
 
202
  _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
203
 
204
- MPT-7B (Base) is **not** intended for deployment without finetuning.
205
  It should not be used for human-facing interactions without further guardrails and user consent.
206
 
207
  MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
@@ -224,11 +236,11 @@ Please cite this model using the following format:
224
  ```
225
  @online{MosaicML2023Introducing,
226
  author = {MosaicML NLP Team},
227
- title = {Introducing MPT-7B: A New Standard for Open-Source,
228
- ly Usable LLMs},
229
  year = {2023},
230
  url = {www.mosaicml.com/blog/mpt-7b},
231
- note = {Accessed: 2023-03-28}, % change this date
232
- urldate = {2023-03-28} % change this date
233
  }
234
  ```
 
14
  inference: false
15
  ---
16
 
 
 
 
 
 
 
17
  # MPT-7B
18
 
19
  MPT-7B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
20
  This model was trained by [MosaicML](https://www.mosaicml.com).
21
 
22
+ MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
23
 
24
+ These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing
25
+ positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
26
+ Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence.
27
+ MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
28
 
29
  This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.
30
 
 
49
  * License: Apache 2.0
50
 
51
  * [MPT-7B-Instruct](https://huggingface.co/mosaicml/mpt-7b-instruct): a model for short-form instruction following.
52
+ Built by finetuning MPT-7B on a [dataset](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) we also release, derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
53
+ * License: Apache 2.0
 
54
 
55
  * [MPT-7B-Chat](https://huggingface.co/mosaicml/mpt-7b-chat): a chatbot-like model for dialogue generation.
56
  Built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3),
57
  [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets.
58
  * License: _CC-By-NC-SA-4.0_
 
59
 
60
  ## Model Date
61
 
 
69
 
70
  * [Blog post: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs](https://www.mosaicml.com/blog/mpt-7b)
71
  * [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
72
+ * Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!
73
 
74
 
75
  ## How to Use
 
83
  trust_remote_code=True
84
  )
85
  ```
86
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
87
  This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
88
  `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
89
 
90
+ To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
91
  ```python
92
+ import torch
93
+ import transformers
94
+
95
+ name = 'mosaicml/mpt-7b'
96
+
97
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
98
  config.attn_config['attn_impl'] = 'triton'
99
+ config.init_device = 'cuda:0' # For fast initialization directly on GPU!
100
 
101
  model = transformers.AutoModelForCausalLM.from_pretrained(
102
+ name,
103
  config=config,
104
+ torch_dtype=torch.bfloat16, # Load model weights in bfloat16
105
  trust_remote_code=True
106
  )
 
107
  ```
108
 
109
  Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
110
 
111
  ```python
112
+ import transformers
113
+
114
+ name = 'mosaicml/mpt-7b'
115
+
116
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
117
+ config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
118
+
119
  model = transformers.AutoModelForCausalLM.from_pretrained(
120
+ name,
121
  config=config,
122
  trust_remote_code=True
123
  )
 
127
 
128
  ```python
129
  from transformers import AutoTokenizer
130
+ tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
131
+ ```
132
+
133
+ The model can then be used, for example, within a text-generation pipeline.
134
+ Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).
135
+
136
+ ```python
137
+ from transformers import pipeline
138
+
139
+ pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
140
+
141
+ with torch.autocast('cuda', dtype=torch.bfloat16):
142
+ print(
143
+ pipe('Here is a recipe for vegan banana bread:\n',
144
+ max_new_tokens=100,
145
+ do_sample=True,
146
+ use_cache=True))
147
  ```
148
 
149
  ## Model Description
 
171
 
172
  ### Streaming Datasets
173
 
174
+ Data was formatted using the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) library to host our data in object storage and efficiently stream it to our compute cluster during training.
175
  StreamingDataset obviates the need to download the whole dataset before starting training, and allows instant resumption of training from any point in the dataset.
176
 
177
 
 
196
  Samples for each batch were selected from one of the datasets with the probability specified above.
197
  The examples were shuffled within each dataset, and each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
198
 
199
+ The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. This BPE tokenizer has a number of desirable characteristics,
200
+ most of which are relevant for tokenizing code:
201
+ (1) It was trained on a diverse mix of data that includes code (The Pile)
202
+ (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
203
+ (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
204
 
205
  The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)), model flop utilization (MFU) increased by up to four percentage points.
206
 
207
  ### Training Configuration
208
 
209
+ This model was trained on 440 A100-40GBs for about 9.5 days using the [MosaicML Platform](https://www.mosaicml.com/platform).
210
+ The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.
211
 
212
  ## Limitations and Biases
213
 
214
  _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
215
 
216
+ MPT-7B (Base) is **not** intended for deployment without finetuning.
217
  It should not be used for human-facing interactions without further guardrails and user consent.
218
 
219
  MPT-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
 
236
  ```
237
  @online{MosaicML2023Introducing,
238
  author = {MosaicML NLP Team},
239
+ title = {Introducing MPT-7B: A New Standard for Open-Source,
240
+ Commercially Usable LLMs},
241
  year = {2023},
242
  url = {www.mosaicml.com/blog/mpt-7b},
243
+ note = {Accessed: 2023-05-05},
244
+ urldate = {2023-05-05}
245
  }
246
  ```
adapt_tokenizer.py CHANGED
@@ -1,9 +1,8 @@
1
- from typing import Union
2
- from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
3
- Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
4
  NUM_SENTINEL_TOKENS: int = 100
5
 
6
- def adapt_tokenizer_for_denoising(tokenizer: Tokenizer):
7
  """Adds sentinel tokens and padding token (if missing).
8
 
9
  Expands the tokenizer vocabulary to include sentinel tokens
@@ -34,7 +33,7 @@ class AutoTokenizerForMOD(AutoTokenizer):
34
  """
35
 
36
  @classmethod
37
- def from_pretrained(cls, *args, **kwargs):
38
  """See `AutoTokenizer.from_pretrained` docstring."""
39
  tokenizer = super().from_pretrained(*args, **kwargs)
40
  adapt_tokenizer_for_denoising(tokenizer)
 
1
+ from typing import Any
2
+ from transformers import AutoTokenizer, PreTrainedTokenizerBase
 
3
  NUM_SENTINEL_TOKENS: int = 100
4
 
5
+ def adapt_tokenizer_for_denoising(tokenizer: PreTrainedTokenizerBase) -> None:
6
  """Adds sentinel tokens and padding token (if missing).
7
 
8
  Expands the tokenizer vocabulary to include sentinel tokens
 
33
  """
34
 
35
  @classmethod
36
+ def from_pretrained(cls, *args: Any, **kwargs: Any) -> PreTrainedTokenizerBase:
37
  """See `AutoTokenizer.from_pretrained` docstring."""
38
  tokenizer = super().from_pretrained(*args, **kwargs)
39
  adapt_tokenizer_for_denoising(tokenizer)
attention.py CHANGED
@@ -1,14 +1,30 @@
1
  """Attention layers."""
2
  import math
3
  import warnings
4
- from typing import Optional
5
  import torch
6
  import torch.nn as nn
7
  from einops import rearrange
 
8
  from torch import nn
9
- from .norm import LPLayerNorm
 
10
 
11
- def _reset_is_causal(num_query_tokens: int, num_key_tokens: int, original_is_causal: bool):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  if original_is_causal and num_query_tokens != num_key_tokens:
13
  if num_query_tokens != 1:
14
  raise NotImplementedError('MPT does not support query and key with different number of tokens, unless number of query tokens is 1.')
@@ -16,27 +32,57 @@ def _reset_is_causal(num_query_tokens: int, num_key_tokens: int, original_is_cau
16
  return False
17
  return original_is_causal
18
 
19
- def scaled_multihead_dot_product_attention(query, key, value, n_heads, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  q = rearrange(query, 'b s (h d) -> b h s d', h=n_heads)
21
- k = rearrange(key, 'b s (h d) -> b h d s', h=1 if multiquery else n_heads)
22
- v = rearrange(value, 'b s (h d) -> b h s d', h=1 if multiquery else n_heads)
23
- min_val = torch.finfo(q.dtype).min
 
 
 
 
24
  (b, _, s_q, d) = q.shape
25
  s_k = k.size(-1)
 
 
 
26
  if softmax_scale is None:
27
  softmax_scale = 1 / math.sqrt(d)
28
  attn_weight = q.matmul(k) * softmax_scale
29
  if attn_bias is not None:
 
 
 
30
  if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2) != 1 and attn_bias.size(-2) != s_q):
31
  raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to broadcast to shape: {attn_weight.shape}.')
32
  attn_weight = attn_weight + attn_bias
 
33
  if key_padding_mask is not None:
34
  if attn_bias is not None:
35
- warnings.warn('Propogating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unneccessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
36
  attn_weight = attn_weight.masked_fill(~key_padding_mask.view((b, 1, 1, s_k)), min_val)
37
- if is_causal:
38
  s = max(s_q, s_k)
39
- causal_mask = attn_weight.new_ones(s, s, dtype=torch.float16)
40
  causal_mask = causal_mask.tril()
41
  causal_mask = causal_mask.to(torch.bool)
42
  causal_mask = ~causal_mask
@@ -45,25 +91,42 @@ def scaled_multihead_dot_product_attention(query, key, value, n_heads, softmax_s
45
  attn_weight = torch.softmax(attn_weight, dim=-1)
46
  if dropout_p:
47
  attn_weight = torch.nn.functional.dropout(attn_weight, p=dropout_p, training=training, inplace=True)
48
- out = attn_weight.matmul(v)
49
  out = rearrange(out, 'b h s d -> b s (h d)')
50
  if needs_weights:
51
- return (out, attn_weight)
52
- return (out, None)
53
 
54
- def check_valid_inputs(*tensors, valid_dtypes=[torch.float16, torch.bfloat16]):
 
 
55
  for tensor in tensors:
56
  if tensor.dtype not in valid_dtypes:
57
  raise TypeError(f'tensor.dtype={tensor.dtype!r} must be in valid_dtypes={valid_dtypes!r}.')
58
  if not tensor.is_cuda:
59
  raise TypeError(f'Inputs must be cuda tensors (tensor.is_cuda={tensor.is_cuda!r}).')
60
 
61
- def flash_attn_fn(query, key, value, n_heads, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
62
  try:
63
  from flash_attn import bert_padding, flash_attn_interface
64
  except:
65
- raise RuntimeError('Please install flash-attn==1.0.3.post0')
66
  check_valid_inputs(query, key, value)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  if attn_bias is not None:
68
  raise NotImplementedError(f'attn_bias not implemented for flash attn.')
69
  (batch_size, seqlen) = query.shape[:2]
@@ -73,26 +136,58 @@ def flash_attn_fn(query, key, value, n_heads, softmax_scale=None, attn_bias=None
73
  (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(query, query_padding_mask)
74
  query_unpad = rearrange(query_unpad, 'nnz (h d) -> nnz h d', h=n_heads)
75
  (key_unpad, _, cu_seqlens_k, max_seqlen_k) = bert_padding.unpad_input(key, key_padding_mask)
76
- key_unpad = rearrange(key_unpad, 'nnz (h d) -> nnz h d', h=1 if multiquery else n_heads)
77
  (value_unpad, _, _, _) = bert_padding.unpad_input(value, key_padding_mask)
78
- value_unpad = rearrange(value_unpad, 'nnz (h d) -> nnz h d', h=1 if multiquery else n_heads)
79
- if multiquery:
80
  key_unpad = key_unpad.expand(key_unpad.size(0), n_heads, key_unpad.size(-1))
81
  value_unpad = value_unpad.expand(value_unpad.size(0), n_heads, value_unpad.size(-1))
 
 
 
82
  dropout_p = dropout_p if training else 0.0
83
  reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
84
- output_unpad = flash_attn_interface.flash_attn_unpadded_func(query_unpad, key_unpad, value_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p, softmax_scale=softmax_scale, causal=reset_is_causal, return_attn_probs=needs_weights)
 
 
 
 
 
85
  output = bert_padding.pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'), indices_q, batch_size, seqlen)
86
- return (output, None)
87
 
88
- def triton_flash_attn_fn(query, key, value, n_heads, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
89
  try:
90
- from flash_attn import flash_attn_triton
91
  except:
92
- raise RuntimeError('Please install flash-attn==1.0.3.post0 and triton==2.0.0.dev20221202')
 
 
 
 
 
 
 
 
93
  check_valid_inputs(query, key, value)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  if dropout_p:
95
  raise NotImplementedError(f'Dropout not implemented for attn_impl: triton.')
 
96
  if needs_weights:
97
  raise NotImplementedError(f'attn_impl: triton cannot return attn weights.')
98
  if key_padding_mask is not None:
@@ -102,136 +197,103 @@ def triton_flash_attn_fn(query, key, value, n_heads, softmax_scale=None, attn_bi
102
  attn_bias = query.new_zeros(b_size, 1, 1, s_k)
103
  attn_bias = attn_bias.masked_fill(~key_padding_mask.view((b_size, 1, 1, s_k)), torch.finfo(query.dtype).min)
104
  query = rearrange(query, 'b s (h d) -> b s h d', h=n_heads)
105
- key = rearrange(key, 'b s (h d) -> b s h d', h=1 if multiquery else n_heads)
106
- value = rearrange(value, 'b s (h d) -> b s h d', h=1 if multiquery else n_heads)
107
- if multiquery:
108
- key = key.expand(*key.shape[:2], n_heads, key.size(-1))
109
- value = value.expand(*value.shape[:2], n_heads, value.size(-1))
 
 
 
110
  reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
111
- attn_output = flash_attn_triton.flash_attn_func(query, key, value, attn_bias, reset_is_causal, softmax_scale)
112
  output = attn_output.view(*attn_output.shape[:2], -1)
113
- return (output, None)
114
 
115
- class MultiheadAttention(nn.Module):
116
- """Multi-head self attention.
117
 
118
- Using torch or triton attention implemetation enables user to also use
119
- additive bias.
 
 
 
120
  """
121
 
122
- def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, device: Optional[str]=None):
123
  super().__init__()
124
  self.attn_impl = attn_impl
125
  self.clip_qkv = clip_qkv
126
  self.qk_ln = qk_ln
127
  self.d_model = d_model
128
  self.n_heads = n_heads
 
 
 
 
 
 
 
 
129
  self.softmax_scale = softmax_scale
130
  if self.softmax_scale is None:
131
  self.softmax_scale = 1 / math.sqrt(self.d_model / self.n_heads)
132
  self.attn_dropout_p = attn_pdrop
133
- self.Wqkv = nn.Linear(self.d_model, 3 * self.d_model, device=device)
134
- fuse_splits = (d_model, 2 * d_model)
 
 
 
135
  self.Wqkv._fused = (0, fuse_splits)
136
  if self.qk_ln:
137
- layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
138
- self.q_ln = layernorm_class(self.d_model, device=device)
139
- self.k_ln = layernorm_class(self.d_model, device=device)
140
  if self.attn_impl == 'flash':
141
  self.attn_fn = flash_attn_fn
142
  elif self.attn_impl == 'triton':
143
  self.attn_fn = triton_flash_attn_fn
144
- warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
145
  elif self.attn_impl == 'torch':
146
  self.attn_fn = scaled_multihead_dot_product_attention
147
- if torch.cuda.is_available():
148
- warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
149
  else:
150
  raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
151
- self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
152
  self.out_proj._is_residual = True
153
 
154
- def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
155
  qkv = self.Wqkv(x)
156
  if self.clip_qkv:
157
- qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
158
- (query, key, value) = qkv.chunk(3, dim=2)
159
  key_padding_mask = attention_mask
160
  if self.qk_ln:
161
  dtype = query.dtype
162
  query = self.q_ln(query).to(dtype)
163
  key = self.k_ln(key).to(dtype)
164
- if past_key_value is not None:
165
- if len(past_key_value) != 0:
166
- key = torch.cat([past_key_value[0], key], dim=1)
167
- value = torch.cat([past_key_value[1], value], dim=1)
168
- past_key_value = (key, value)
169
- if attn_bias is not None:
170
- attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
171
- (context, attn_weights) = self.attn_fn(query, key, value, self.n_heads, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights)
172
  return (self.out_proj(context), attn_weights, past_key_value)
173
 
174
- class MultiQueryAttention(nn.Module):
175
- """Multi-Query self attention.
176
 
177
- Using torch or triton attention implemetation enables user to also use
178
  additive bias.
179
  """
180
 
181
- def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, device: Optional[str]=None):
182
- super().__init__()
183
- self.attn_impl = attn_impl
184
- self.clip_qkv = clip_qkv
185
- self.qk_ln = qk_ln
186
- self.d_model = d_model
187
- self.n_heads = n_heads
188
- self.head_dim = d_model // n_heads
189
- self.softmax_scale = softmax_scale
190
- if self.softmax_scale is None:
191
- self.softmax_scale = 1 / math.sqrt(self.head_dim)
192
- self.attn_dropout_p = attn_pdrop
193
- self.Wqkv = nn.Linear(d_model, d_model + 2 * self.head_dim, device=device)
194
- fuse_splits = (d_model, d_model + self.head_dim)
195
- self.Wqkv._fused = (0, fuse_splits)
196
- if self.qk_ln:
197
- layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
198
- self.q_ln = layernorm_class(d_model, device=device)
199
- self.k_ln = layernorm_class(self.head_dim, device=device)
200
- if self.attn_impl == 'flash':
201
- self.attn_fn = flash_attn_fn
202
- elif self.attn_impl == 'triton':
203
- self.attn_fn = triton_flash_attn_fn
204
- warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
205
- elif self.attn_impl == 'torch':
206
- self.attn_fn = scaled_multihead_dot_product_attention
207
- if torch.cuda.is_available():
208
- warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
209
- else:
210
- raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
211
- self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
212
- self.out_proj._is_residual = True
213
 
214
- def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
215
- qkv = self.Wqkv(x)
216
- if self.clip_qkv:
217
- qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
218
- (query, key, value) = qkv.split([self.d_model, self.head_dim, self.head_dim], dim=2)
219
- key_padding_mask = attention_mask
220
- if self.qk_ln:
221
- dtype = query.dtype
222
- query = self.q_ln(query).to(dtype)
223
- key = self.k_ln(key).to(dtype)
224
- if past_key_value is not None:
225
- if len(past_key_value) != 0:
226
- key = torch.cat([past_key_value[0], key], dim=1)
227
- value = torch.cat([past_key_value[1], value], dim=1)
228
- past_key_value = (key, value)
229
- if attn_bias is not None:
230
- attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
231
- (context, attn_weights) = self.attn_fn(query, key, value, self.n_heads, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights, multiquery=True)
232
- return (self.out_proj(context), attn_weights, past_key_value)
233
 
234
- def attn_bias_shape(attn_impl, n_heads, seq_len, alibi, prefix_lm, causal, use_sequence_id):
235
  if attn_impl == 'flash':
236
  return None
237
  elif attn_impl in ['torch', 'triton']:
@@ -245,7 +307,7 @@ def attn_bias_shape(attn_impl, n_heads, seq_len, alibi, prefix_lm, causal, use_s
245
  else:
246
  raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
247
 
248
- def build_attn_bias(attn_impl, attn_bias, n_heads, seq_len, causal=False, alibi=False, alibi_bias_max=8):
249
  if attn_impl == 'flash':
250
  return None
251
  elif attn_impl in ['torch', 'triton']:
@@ -256,7 +318,7 @@ def build_attn_bias(attn_impl, attn_bias, n_heads, seq_len, causal=False, alibi=
256
  else:
257
  raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
258
 
259
- def gen_slopes(n_heads, alibi_bias_max=8, device=None):
260
  _n_heads = 2 ** math.ceil(math.log2(n_heads))
261
  m = torch.arange(1, _n_heads + 1, dtype=torch.float32, device=device)
262
  m = m.mul(alibi_bias_max / _n_heads)
@@ -265,7 +327,7 @@ def gen_slopes(n_heads, alibi_bias_max=8, device=None):
265
  slopes = torch.concat([slopes[1::2], slopes[::2]])[:n_heads]
266
  return slopes.view(1, n_heads, 1, 1)
267
 
268
- def build_alibi_bias(n_heads, seq_len, full=False, alibi_bias_max=8, device=None, dtype=None):
269
  alibi_bias = torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, 1, seq_len)
270
  if full:
271
  alibi_bias = alibi_bias - torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, seq_len, 1)
@@ -273,4 +335,4 @@ def build_alibi_bias(n_heads, seq_len, full=False, alibi_bias_max=8, device=None
273
  slopes = gen_slopes(n_heads, alibi_bias_max, device=device)
274
  alibi_bias = alibi_bias * slopes
275
  return alibi_bias.to(dtype=dtype)
276
- ATTN_CLASS_REGISTRY = {'multihead_attention': MultiheadAttention, 'multiquery_attention': MultiQueryAttention}
 
1
  """Attention layers."""
2
  import math
3
  import warnings
4
+ from typing import Any, List, Optional, Tuple
5
  import torch
6
  import torch.nn as nn
7
  from einops import rearrange
8
+ from packaging import version
9
  from torch import nn
10
+ from .fc import FC_CLASS_REGISTRY
11
+ from .norm import NORM_CLASS_REGISTRY
12
 
13
+ def is_flash_v2_installed():
14
+ try:
15
+ import flash_attn as flash_attn
16
+ except:
17
+ return False
18
+ return version.parse(flash_attn.__version__) >= version.parse('2.0.0')
19
+
20
+ def is_flash_v1_installed():
21
+ try:
22
+ import flash_attn as flash_attn
23
+ except:
24
+ return False
25
+ return version.parse(flash_attn.__version__) < version.parse('2.0.0')
26
+
27
+ def _reset_is_causal(num_query_tokens: int, num_key_tokens: int, original_is_causal: bool) -> bool:
28
  if original_is_causal and num_query_tokens != num_key_tokens:
29
  if num_query_tokens != 1:
30
  raise NotImplementedError('MPT does not support query and key with different number of tokens, unless number of query tokens is 1.')
 
32
  return False
33
  return original_is_causal
34
 
35
+ def repeat_kv_for_gqa(hidden: torch.Tensor, n_rep: int) -> torch.Tensor:
36
+ """Perform repeat of kv heads along a particular dimension.
37
+
38
+ hidden.shape expected to be: (batch size, seq len, kv_n_heads, head_dim)
39
+ n_rep: amount of repetitions of kv_n_heads
40
+ Unlike torch.repeat_interleave, this function avoids allocating new memory.
41
+ """
42
+ if n_rep == 1:
43
+ return hidden
44
+ (b, s, kv_n_heads, d) = hidden.shape
45
+ hidden = hidden[:, :, :, None, :].expand(b, s, kv_n_heads, n_rep, d)
46
+ return hidden.reshape(b, s, kv_n_heads * n_rep, d)
47
+
48
+ def scaled_multihead_dot_product_attention(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, n_heads: int, kv_n_heads: Optional[int]=None, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, softmax_scale: Optional[float]=None, attn_bias: Optional[torch.Tensor]=None, key_padding_mask: Optional[torch.Tensor]=None, is_causal: bool=False, dropout_p: float=0.0, training: bool=False, needs_weights: bool=False, multiquery: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
49
+ if multiquery:
50
+ warnings.warn(DeprecationWarning('The direct use of the multiquery arg is deprecated. Setting kv_n_heads=1 automatically. Please set kv_n_heads=1 explicitly to remove this warning.'))
51
+ kv_n_heads = 1
52
+ elif kv_n_heads is None:
53
+ warnings.warn(DeprecationWarning('Not specifying a value for the kv_n_heads arg is deprecated. Setting kv_n_heads=n_heads automatically. Please set kv_n_heads=n_heads explicitly to remove this warning.'))
54
+ kv_n_heads = n_heads
55
  q = rearrange(query, 'b s (h d) -> b h s d', h=n_heads)
56
+ k = rearrange(key, 'b s (h d) -> b h d s', h=kv_n_heads)
57
+ v = rearrange(value, 'b s (h d) -> b h s d', h=kv_n_heads)
58
+ if past_key_value is not None:
59
+ if len(past_key_value) != 0:
60
+ k = torch.cat([past_key_value[0], k], dim=3)
61
+ v = torch.cat([past_key_value[1], v], dim=2)
62
+ past_key_value = (k, v)
63
  (b, _, s_q, d) = q.shape
64
  s_k = k.size(-1)
65
+ if kv_n_heads > 1 and kv_n_heads < n_heads:
66
+ k = repeat_kv_for_gqa(k.transpose(1, 2), n_heads // kv_n_heads).transpose(1, 2)
67
+ v = repeat_kv_for_gqa(v.transpose(1, 2), n_heads // kv_n_heads).transpose(1, 2)
68
  if softmax_scale is None:
69
  softmax_scale = 1 / math.sqrt(d)
70
  attn_weight = q.matmul(k) * softmax_scale
71
  if attn_bias is not None:
72
+ _s_q = max(0, attn_bias.size(2) - s_q)
73
+ _s_k = max(0, attn_bias.size(3) - s_k)
74
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
75
  if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2) != 1 and attn_bias.size(-2) != s_q):
76
  raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to broadcast to shape: {attn_weight.shape}.')
77
  attn_weight = attn_weight + attn_bias
78
+ min_val = torch.finfo(q.dtype).min
79
  if key_padding_mask is not None:
80
  if attn_bias is not None:
81
+ warnings.warn('Propagating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unnecessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
82
  attn_weight = attn_weight.masked_fill(~key_padding_mask.view((b, 1, 1, s_k)), min_val)
83
+ if is_causal and (not q.size(2) == 1):
84
  s = max(s_q, s_k)
85
+ causal_mask = attn_weight.new_ones(s, s, dtype=torch.float32)
86
  causal_mask = causal_mask.tril()
87
  causal_mask = causal_mask.to(torch.bool)
88
  causal_mask = ~causal_mask
 
91
  attn_weight = torch.softmax(attn_weight, dim=-1)
92
  if dropout_p:
93
  attn_weight = torch.nn.functional.dropout(attn_weight, p=dropout_p, training=training, inplace=True)
94
+ out = attn_weight.to(v.dtype).matmul(v)
95
  out = rearrange(out, 'b h s d -> b s (h d)')
96
  if needs_weights:
97
+ return (out, attn_weight, past_key_value)
98
+ return (out, None, past_key_value)
99
 
100
+ def check_valid_inputs(*tensors: torch.Tensor, valid_dtypes: Optional[List[torch.dtype]]=None):
101
+ if valid_dtypes is None:
102
+ valid_dtypes = [torch.float16, torch.bfloat16]
103
  for tensor in tensors:
104
  if tensor.dtype not in valid_dtypes:
105
  raise TypeError(f'tensor.dtype={tensor.dtype!r} must be in valid_dtypes={valid_dtypes!r}.')
106
  if not tensor.is_cuda:
107
  raise TypeError(f'Inputs must be cuda tensors (tensor.is_cuda={tensor.is_cuda!r}).')
108
 
109
+ def flash_attn_fn(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, n_heads: int, kv_n_heads: Optional[int]=None, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, softmax_scale: Optional[float]=None, attn_bias: Optional[torch.Tensor]=None, key_padding_mask: Optional[torch.Tensor]=None, is_causal: bool=False, dropout_p: float=0.0, training: bool=False, needs_weights: bool=False, multiquery: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
110
  try:
111
  from flash_attn import bert_padding, flash_attn_interface
112
  except:
113
+ raise RuntimeError('Please install flash-attn==1.0.9 or flash-attn==2.3.2')
114
  check_valid_inputs(query, key, value)
115
+ if multiquery:
116
+ warnings.warn(DeprecationWarning('The direct use of the multiquery arg is deprecated. Setting kv_n_heads=1 automatically. Please set kv_n_heads=1 explicitly to remove this warning.'))
117
+ kv_n_heads = 1
118
+ elif kv_n_heads is None:
119
+ warnings.warn(DeprecationWarning('Not specifying a value for the kv_n_heads arg is deprecated. Setting kv_n_heads=n_heads automatically. Please set kv_n_heads=n_heads explicitly to remove this warning.'))
120
+ kv_n_heads = n_heads
121
+ if past_key_value is not None:
122
+ if len(past_key_value) != 0:
123
+ key = torch.cat([past_key_value[0], key], dim=1)
124
+ value = torch.cat([past_key_value[1], value], dim=1)
125
+ past_key_value = (key, value)
126
+ if attn_bias is not None:
127
+ _s_q = max(0, attn_bias.size(2) - query.size(1))
128
+ _s_k = max(0, attn_bias.size(3) - key.size(1))
129
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
130
  if attn_bias is not None:
131
  raise NotImplementedError(f'attn_bias not implemented for flash attn.')
132
  (batch_size, seqlen) = query.shape[:2]
 
136
  (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(query, query_padding_mask)
137
  query_unpad = rearrange(query_unpad, 'nnz (h d) -> nnz h d', h=n_heads)
138
  (key_unpad, _, cu_seqlens_k, max_seqlen_k) = bert_padding.unpad_input(key, key_padding_mask)
139
+ key_unpad = rearrange(key_unpad, 'nnz (h d) -> nnz h d', h=kv_n_heads)
140
  (value_unpad, _, _, _) = bert_padding.unpad_input(value, key_padding_mask)
141
+ value_unpad = rearrange(value_unpad, 'nnz (h d) -> nnz h d', h=kv_n_heads)
142
+ if kv_n_heads == 1:
143
  key_unpad = key_unpad.expand(key_unpad.size(0), n_heads, key_unpad.size(-1))
144
  value_unpad = value_unpad.expand(value_unpad.size(0), n_heads, value_unpad.size(-1))
145
+ elif kv_n_heads < n_heads:
146
+ key_unpad = repeat_kv_for_gqa(key_unpad.view(batch_size, seqlen, kv_n_heads, -1), n_heads // kv_n_heads).view(batch_size * seqlen, n_heads, -1)
147
+ value_unpad = repeat_kv_for_gqa(value_unpad.view(batch_size, seqlen, kv_n_heads, -1), n_heads // kv_n_heads).view(batch_size * seqlen, n_heads, -1)
148
  dropout_p = dropout_p if training else 0.0
149
  reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
150
+ if is_flash_v1_installed():
151
+ output_unpad = flash_attn_interface.flash_attn_unpadded_func(q=query_unpad, k=key_unpad, v=value_unpad, cu_seqlens_q=cu_seqlens_q, cu_seqlens_k=cu_seqlens_k, max_seqlen_q=max_seqlen_q, max_seqlen_k=max_seqlen_k, dropout_p=dropout_p, softmax_scale=softmax_scale, causal=reset_is_causal, return_attn_probs=needs_weights)
152
+ elif is_flash_v2_installed():
153
+ output_unpad = flash_attn_interface.flash_attn_varlen_func(q=query_unpad, k=key_unpad, v=value_unpad, cu_seqlens_q=cu_seqlens_q, cu_seqlens_k=cu_seqlens_k, max_seqlen_q=max_seqlen_q, max_seqlen_k=max_seqlen_k, dropout_p=dropout_p, softmax_scale=softmax_scale, causal=reset_is_causal, return_attn_probs=needs_weights)
154
+ else:
155
+ raise RuntimeError('flash-attn==1.0.9 or flash-attn==2.3.2 is required.')
156
  output = bert_padding.pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'), indices_q, batch_size, seqlen)
157
+ return (output, None, past_key_value)
158
 
159
+ def triton_flash_attn_fn(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, n_heads: int, kv_n_heads: Optional[int]=None, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, softmax_scale: Optional[float]=None, attn_bias: Optional[torch.Tensor]=None, key_padding_mask: Optional[torch.Tensor]=None, is_causal: bool=False, dropout_p: float=0.0, training: bool=False, needs_weights: bool=False, multiquery: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
160
  try:
161
+ from .flash_attn_triton import flash_attn_func
162
  except:
163
+ _installed = False
164
+ if version.parse(torch.__version__) < version.parse('2.0.0'):
165
+ _installed = True
166
+ try:
167
+ from flash_attn.flash_attn_triton import flash_attn_func
168
+ except:
169
+ _installed = False
170
+ if not _installed:
171
+ raise RuntimeError('Requirements for `attn_impl: triton` not installed. Either (1) have a CUDA-compatible GPU ' + 'and `pip install .[gpu]` if installing from llm-foundry source or ' + '`pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python` ' + 'if installing from pypi, or (2) use torch attn model.attn_config.attn_impl=torch (torch attn_impl will be slow). ' + 'Note: (1) requires you have CMake and PyTorch already installed.')
172
  check_valid_inputs(query, key, value)
173
+ if multiquery:
174
+ warnings.warn(DeprecationWarning('The direct use of the multiquery arg is deprecated. Setting kv_n_heads=1 automatically. Please set kv_n_heads=1 explicitly to remove this warning.'))
175
+ kv_n_heads = 1
176
+ elif kv_n_heads is None:
177
+ warnings.warn(DeprecationWarning('Not specifying a value for the kv_n_heads arg is deprecated. Setting kv_n_heads=n_heads automatically. Please set kv_n_heads=n_heads explicitly to remove this warning.'))
178
+ kv_n_heads = n_heads
179
+ if past_key_value is not None:
180
+ if len(past_key_value) != 0:
181
+ key = torch.cat([past_key_value[0], key], dim=1)
182
+ value = torch.cat([past_key_value[1], value], dim=1)
183
+ past_key_value = (key, value)
184
+ if attn_bias is not None:
185
+ _s_q = max(0, attn_bias.size(2) - query.size(1))
186
+ _s_k = max(0, attn_bias.size(3) - key.size(1))
187
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
188
  if dropout_p:
189
  raise NotImplementedError(f'Dropout not implemented for attn_impl: triton.')
190
+ dropout_p = dropout_p if training else 0.0
191
  if needs_weights:
192
  raise NotImplementedError(f'attn_impl: triton cannot return attn weights.')
193
  if key_padding_mask is not None:
 
197
  attn_bias = query.new_zeros(b_size, 1, 1, s_k)
198
  attn_bias = attn_bias.masked_fill(~key_padding_mask.view((b_size, 1, 1, s_k)), torch.finfo(query.dtype).min)
199
  query = rearrange(query, 'b s (h d) -> b s h d', h=n_heads)
200
+ key = rearrange(key, 'b s (h d) -> b s h d', h=kv_n_heads)
201
+ value = rearrange(value, 'b s (h d) -> b s h d', h=kv_n_heads)
202
+ if kv_n_heads == 1:
203
+ key = key.repeat(1, 1, n_heads, 1)
204
+ value = value.repeat(1, 1, n_heads, 1)
205
+ elif kv_n_heads < n_heads:
206
+ key = repeat_kv_for_gqa(key, n_heads // kv_n_heads)
207
+ value = repeat_kv_for_gqa(value, n_heads // kv_n_heads)
208
  reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
209
+ attn_output = flash_attn_func(query, key, value, attn_bias, reset_is_causal, softmax_scale)
210
  output = attn_output.view(*attn_output.shape[:2], -1)
211
+ return (output, None, past_key_value)
212
 
213
+ class GroupedQueryAttention(nn.Module):
214
+ """Grouped Query Attention (GQA) is a generalization of Multi-head (MHA).
215
 
216
+ and Multi-query attention (MQA).
217
+
218
+ This allows the user to set a variable of number of kv_n_heads, rather than
219
+ just n_heads or 1, as in MHA and MQA. Using torch or triton attention
220
+ implementation enables user to also use additive bias.
221
  """
222
 
223
+ def __init__(self, d_model: int, n_heads: int, kv_n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None, bias: bool=True):
224
  super().__init__()
225
  self.attn_impl = attn_impl
226
  self.clip_qkv = clip_qkv
227
  self.qk_ln = qk_ln
228
  self.d_model = d_model
229
  self.n_heads = n_heads
230
+ self.kv_n_heads = kv_n_heads
231
+ self.head_dim = d_model // n_heads
232
+ if self.kv_n_heads <= 0:
233
+ raise ValueError('kv_n_heads should be greater than zero.')
234
+ if self.kv_n_heads > self.n_heads:
235
+ raise ValueError('The number of KV heads should be less than or equal to Q heads.')
236
+ if self.n_heads % self.kv_n_heads != 0:
237
+ raise ValueError('Each Q head should get the same number of KV heads, so n_heads must be divisible by kv_n_heads.')
238
  self.softmax_scale = softmax_scale
239
  if self.softmax_scale is None:
240
  self.softmax_scale = 1 / math.sqrt(self.d_model / self.n_heads)
241
  self.attn_dropout_p = attn_pdrop
242
+ fc_kwargs: dict[str, Any] = {'bias': bias}
243
+ if fc_type != 'te':
244
+ fc_kwargs['device'] = device
245
+ self.Wqkv = FC_CLASS_REGISTRY[fc_type](self.d_model, self.d_model + 2 * self.kv_n_heads * self.head_dim, **fc_kwargs)
246
+ fuse_splits = [i * self.head_dim for i in range(1, self.n_heads + 2 * self.kv_n_heads)]
247
  self.Wqkv._fused = (0, fuse_splits)
248
  if self.qk_ln:
249
+ norm_class = NORM_CLASS_REGISTRY[norm_type.lower()]
250
+ self.q_ln = norm_class(self.d_model, device=device)
251
+ self.k_ln = norm_class(self.kv_n_heads * self.head_dim, device=device)
252
  if self.attn_impl == 'flash':
253
  self.attn_fn = flash_attn_fn
254
  elif self.attn_impl == 'triton':
255
  self.attn_fn = triton_flash_attn_fn
 
256
  elif self.attn_impl == 'torch':
257
  self.attn_fn = scaled_multihead_dot_product_attention
 
 
258
  else:
259
  raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
260
+ self.out_proj = FC_CLASS_REGISTRY[fc_type](self.d_model, self.d_model, **fc_kwargs)
261
  self.out_proj._is_residual = True
262
 
263
+ def forward(self, x: torch.Tensor, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, attn_bias: Optional[torch.Tensor]=None, attention_mask: Optional[torch.Tensor]=None, is_causal: bool=True, needs_weights: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
264
  qkv = self.Wqkv(x)
265
  if self.clip_qkv:
266
+ qkv = qkv.clamp(min=-self.clip_qkv, max=self.clip_qkv)
267
+ (query, key, value) = qkv.split([self.d_model, self.kv_n_heads * self.head_dim, self.kv_n_heads * self.head_dim], dim=2)
268
  key_padding_mask = attention_mask
269
  if self.qk_ln:
270
  dtype = query.dtype
271
  query = self.q_ln(query).to(dtype)
272
  key = self.k_ln(key).to(dtype)
273
+ (context, attn_weights, past_key_value) = self.attn_fn(query, key, value, self.n_heads, self.kv_n_heads, past_key_value=past_key_value, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights)
 
 
 
 
 
 
 
274
  return (self.out_proj(context), attn_weights, past_key_value)
275
 
276
+ class MultiheadAttention(GroupedQueryAttention):
277
+ """Multi-head self attention.
278
 
279
+ Using torch or triton attention implementation enables user to also use
280
  additive bias.
281
  """
282
 
283
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None, bias: bool=True):
284
+ super().__init__(d_model=d_model, n_heads=n_heads, kv_n_heads=n_heads, attn_impl=attn_impl, clip_qkv=clip_qkv, qk_ln=qk_ln, softmax_scale=softmax_scale, attn_pdrop=attn_pdrop, norm_type=norm_type, fc_type=fc_type, device=device, bias=bias)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
+ class MultiQueryAttention(GroupedQueryAttention):
287
+ """Multi-Query self attention.
288
+
289
+ Using torch or triton attention implementation enables user to also use
290
+ additive bias.
291
+ """
292
+
293
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None, bias: bool=True):
294
+ super().__init__(d_model=d_model, n_heads=n_heads, kv_n_heads=1, attn_impl=attn_impl, clip_qkv=clip_qkv, qk_ln=qk_ln, softmax_scale=softmax_scale, attn_pdrop=attn_pdrop, norm_type=norm_type, fc_type=fc_type, device=device, bias=bias)
 
 
 
 
 
 
 
 
 
 
295
 
296
+ def attn_bias_shape(attn_impl: str, n_heads: int, seq_len: int, alibi: bool, prefix_lm: bool, causal: bool, use_sequence_id: bool) -> Optional[Tuple[int, int, int, int]]:
297
  if attn_impl == 'flash':
298
  return None
299
  elif attn_impl in ['torch', 'triton']:
 
307
  else:
308
  raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
309
 
310
+ def build_attn_bias(attn_impl: str, attn_bias: torch.Tensor, n_heads: int, seq_len: int, causal: bool=False, alibi: bool=False, alibi_bias_max: int=8) -> Optional[torch.Tensor]:
311
  if attn_impl == 'flash':
312
  return None
313
  elif attn_impl in ['torch', 'triton']:
 
318
  else:
319
  raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
320
 
321
+ def gen_slopes(n_heads: int, alibi_bias_max: int=8, device: Optional[torch.device]=None) -> torch.Tensor:
322
  _n_heads = 2 ** math.ceil(math.log2(n_heads))
323
  m = torch.arange(1, _n_heads + 1, dtype=torch.float32, device=device)
324
  m = m.mul(alibi_bias_max / _n_heads)
 
327
  slopes = torch.concat([slopes[1::2], slopes[::2]])[:n_heads]
328
  return slopes.view(1, n_heads, 1, 1)
329
 
330
+ def build_alibi_bias(n_heads: int, seq_len: int, full: bool=False, alibi_bias_max: int=8, device: Optional[torch.device]=None, dtype: Optional[torch.dtype]=None) -> torch.Tensor:
331
  alibi_bias = torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, 1, seq_len)
332
  if full:
333
  alibi_bias = alibi_bias - torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, seq_len, 1)
 
335
  slopes = gen_slopes(n_heads, alibi_bias_max, device=device)
336
  alibi_bias = alibi_bias * slopes
337
  return alibi_bias.to(dtype=dtype)
338
+ ATTN_CLASS_REGISTRY = {'multihead_attention': MultiheadAttention, 'multiquery_attention': MultiQueryAttention, 'grouped_query_attention': GroupedQueryAttention}
blocks.py CHANGED
@@ -1,41 +1,41 @@
1
  """GPT Blocks used for the GPT Model."""
2
- from typing import Dict, Optional, Tuple
3
  import torch
4
  import torch.nn as nn
5
  from .attention import ATTN_CLASS_REGISTRY
 
6
  from .norm import NORM_CLASS_REGISTRY
7
 
8
- class MPTMLP(nn.Module):
9
-
10
- def __init__(self, d_model: int, expansion_ratio: int, device: Optional[str]=None):
11
- super().__init__()
12
- self.up_proj = nn.Linear(d_model, expansion_ratio * d_model, device=device)
13
- self.act = nn.GELU(approximate='none')
14
- self.down_proj = nn.Linear(expansion_ratio * d_model, d_model, device=device)
15
- self.down_proj._is_residual = True
16
-
17
- def forward(self, x):
18
- return self.down_proj(self.act(self.up_proj(x)))
19
-
20
  class MPTBlock(nn.Module):
21
 
22
- def __init__(self, d_model: int, n_heads: int, expansion_ratio: int, attn_config: Dict={'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}, resid_pdrop: float=0.0, norm_type: str='low_precision_layernorm', device: Optional[str]=None, **kwargs):
 
 
 
 
23
  del kwargs
24
  super().__init__()
25
  norm_class = NORM_CLASS_REGISTRY[norm_type.lower()]
 
26
  attn_class = ATTN_CLASS_REGISTRY[attn_config['attn_type']]
 
 
27
  self.norm_1 = norm_class(d_model, device=device)
28
- self.attn = attn_class(attn_impl=attn_config['attn_impl'], clip_qkv=attn_config['clip_qkv'], qk_ln=attn_config['qk_ln'], softmax_scale=attn_config['softmax_scale'], attn_pdrop=attn_config['attn_pdrop'], d_model=d_model, n_heads=n_heads, device=device)
29
- self.norm_2 = norm_class(d_model, device=device)
30
- self.ffn = MPTMLP(d_model=d_model, expansion_ratio=expansion_ratio, device=device)
 
 
31
  self.resid_attn_dropout = nn.Dropout(resid_pdrop)
32
  self.resid_ffn_dropout = nn.Dropout(resid_pdrop)
33
 
34
- def forward(self, x: torch.Tensor, past_key_value: Optional[Tuple[torch.Tensor]]=None, attn_bias: Optional[torch.Tensor]=None, attention_mask: Optional[torch.ByteTensor]=None, is_causal: bool=True) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor]]]:
35
  a = self.norm_1(x)
36
- (b, _, past_key_value) = self.attn(a, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=is_causal)
37
  x = x + self.resid_attn_dropout(b)
38
- m = self.norm_2(x)
 
 
39
  n = self.ffn(m)
40
  x = x + self.resid_ffn_dropout(n)
41
- return (x, past_key_value)
 
1
  """GPT Blocks used for the GPT Model."""
2
+ from typing import Any, Dict, Optional, Tuple
3
  import torch
4
  import torch.nn as nn
5
  from .attention import ATTN_CLASS_REGISTRY
6
+ from .ffn import FFN_CLASS_REGISTRY, build_ffn
7
  from .norm import NORM_CLASS_REGISTRY
8
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  class MPTBlock(nn.Module):
10
 
11
+ def __init__(self, d_model: int, n_heads: int, expansion_ratio: int, attn_config: Optional[Dict]=None, ffn_config: Optional[Dict]=None, resid_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None, no_bias: bool=False, **kwargs: Any):
12
+ if attn_config is None:
13
+ attn_config = {'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}
14
+ if ffn_config is None:
15
+ ffn_config = {'ffn_type': 'mptmlp'}
16
  del kwargs
17
  super().__init__()
18
  norm_class = NORM_CLASS_REGISTRY[norm_type.lower()]
19
+ assert isinstance(attn_config['attn_type'], str)
20
  attn_class = ATTN_CLASS_REGISTRY[attn_config['attn_type']]
21
+ args_to_exclude_in_attn_class = {'attn_type', 'prefix_lm', 'alibi', 'attn_uses_sequence_id', 'alibi_bias_max'}
22
+ attn_config_subset_for_attn_class = {k: v for (k, v) in attn_config.items() if k not in args_to_exclude_in_attn_class}
23
  self.norm_1 = norm_class(d_model, device=device)
24
+ self.attn = attn_class(d_model=d_model, n_heads=n_heads, fc_type=fc_type, device=device, **attn_config_subset_for_attn_class, bias=not no_bias)
25
+ self.norm_2 = None
26
+ if not getattr(FFN_CLASS_REGISTRY[ffn_config['ffn_type']], '_has_norm', False):
27
+ self.norm_2 = norm_class(d_model, device=device)
28
+ self.ffn = build_ffn(d_model=d_model, expansion_ratio=expansion_ratio, device=device, bias=not no_bias, **ffn_config)
29
  self.resid_attn_dropout = nn.Dropout(resid_pdrop)
30
  self.resid_ffn_dropout = nn.Dropout(resid_pdrop)
31
 
32
+ def forward(self, x: torch.Tensor, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, attn_bias: Optional[torch.Tensor]=None, attention_mask: Optional[torch.ByteTensor]=None, is_causal: bool=True, output_attentions: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
33
  a = self.norm_1(x)
34
+ (b, attn_weights, past_key_value) = self.attn(a, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=is_causal, needs_weights=output_attentions)
35
  x = x + self.resid_attn_dropout(b)
36
+ m = x
37
+ if self.norm_2 is not None:
38
+ m = self.norm_2(x)
39
  n = self.ffn(m)
40
  x = x + self.resid_ffn_dropout(n)
41
+ return (x, attn_weights, past_key_value)
configuration_mpt.py CHANGED
@@ -1,27 +1,29 @@
1
  """A HuggingFace-style model configuration."""
2
- from typing import Dict, Optional, Union
 
3
  from transformers import PretrainedConfig
4
  attn_config_defaults: Dict = {'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}
5
- init_config_defaults: Dict = {'name': 'kaiming_normal_', 'fan_mode': 'fan_in', 'init_nonlinearity': 'relu'}
 
6
 
7
  class MPTConfig(PretrainedConfig):
8
  model_type = 'mpt'
9
 
10
- def __init__(self, d_model: int=2048, n_heads: int=16, n_layers: int=24, expansion_ratio: int=4, max_seq_len: int=2048, vocab_size: int=50368, resid_pdrop: float=0.0, emb_pdrop: float=0.0, learned_pos_emb: bool=True, attn_config: Dict=attn_config_defaults, init_device: str='cpu', logit_scale: Optional[Union[float, str]]=None, no_bias: bool=False, verbose: int=0, embedding_fraction: float=1.0, norm_type: str='low_precision_layernorm', use_cache: bool=False, init_config: Dict=init_config_defaults, **kwargs):
11
  """The MPT configuration class.
12
 
13
  Args:
14
  d_model (int): The size of the embedding dimension of the model.
15
  n_heads (int): The number of attention heads.
16
  n_layers (int): The number of layers in the model.
17
- expansion_ratio (int): The ratio of the up/down scale in the MLP.
18
  max_seq_len (int): The maximum sequence length of the model.
19
  vocab_size (int): The size of the vocabulary.
20
  resid_pdrop (float): The dropout probability applied to the attention output before combining with residual.
21
  emb_pdrop (float): The dropout probability for the embedding layer.
22
  learned_pos_emb (bool): Whether to use learned positional embeddings
23
- attn_config (Dict): A dictionary used to configure the model's attention module:
24
- attn_type (str): type of attention to use. Options: multihead_attention, multiquery_attention
25
  attn_pdrop (float): The dropout probability for the attention layers.
26
  attn_impl (str): The attention implementation to use. One of 'torch', 'flash', or 'triton'.
27
  qk_ln (bool): Whether to apply layer normalization to the queries and keys in the attention layer.
@@ -38,13 +40,15 @@ class MPTConfig(PretrainedConfig):
38
  Defaults to ``False`` meaning any provided `sequence_id` will be ignored.
39
  alibi (bool): Whether to use the alibi bias instead of position embeddings.
40
  alibi_bias_max (int): The maximum value of the alibi bias.
 
 
 
41
  init_device (str): The device to use for parameter initialization.
42
  logit_scale (Optional[Union[float, str]]): If not None, scale the logits by this value.
43
  no_bias (bool): Whether to use bias in all layers.
44
  verbose (int): The verbosity level. 0 is silent.
45
  embedding_fraction (float): The fraction to scale the gradients of the embedding layer by.
46
  norm_type (str): choose type of norm to use
47
- multiquery_attention (bool): Whether to use multiquery attention implementation.
48
  use_cache (bool): Whether or not the model should return the last key/values attentions
49
  init_config (Dict): A dictionary used to configure the model initialization:
50
  init_config.name: The parameter initialization scheme to use. Options: 'default_', 'baseline_',
@@ -61,6 +65,7 @@ class MPTConfig(PretrainedConfig):
61
  init_nonlinearity (str): The nonlinearity to use for parameter initialization with kaiming initialization schemes.
62
  ---
63
  See llmfoundry.models.utils.param_init_fns.py for info on other param init config options
 
64
  """
65
  self.d_model = d_model
66
  self.n_heads = n_heads
@@ -72,29 +77,36 @@ class MPTConfig(PretrainedConfig):
72
  self.emb_pdrop = emb_pdrop
73
  self.learned_pos_emb = learned_pos_emb
74
  self.attn_config = attn_config
 
75
  self.init_device = init_device
76
  self.logit_scale = logit_scale
77
  self.no_bias = no_bias
78
- self.verbose = verbose
79
  self.embedding_fraction = embedding_fraction
80
  self.norm_type = norm_type
81
  self.use_cache = use_cache
82
  self.init_config = init_config
 
 
 
83
  if 'name' in kwargs:
84
  del kwargs['name']
85
  if 'loss_fn' in kwargs:
86
  del kwargs['loss_fn']
 
 
 
87
  super().__init__(**kwargs)
88
  self._validate_config()
89
 
90
- def _set_config_defaults(self, config, config_defaults):
91
  for (k, v) in config_defaults.items():
92
  if k not in config:
93
  config[k] = v
94
  return config
95
 
96
- def _validate_config(self):
97
  self.attn_config = self._set_config_defaults(self.attn_config, attn_config_defaults)
 
98
  self.init_config = self._set_config_defaults(self.init_config, init_config_defaults)
99
  if self.d_model % self.n_heads != 0:
100
  raise ValueError('d_model must be divisible by n_heads')
@@ -115,4 +127,14 @@ class MPTConfig(PretrainedConfig):
115
  if self.init_config.get('name', None) is None:
116
  raise ValueError(f"self.init_config={self.init_config!r} 'name' needs to be set.")
117
  if not self.learned_pos_emb and (not self.attn_config['alibi']):
118
- raise ValueError(f'Positional information must be provided to the model using either learned_pos_emb or alibi.')
 
 
 
 
 
 
 
 
 
 
 
1
  """A HuggingFace-style model configuration."""
2
+ import warnings
3
+ from typing import Any, Dict, Optional, Union
4
  from transformers import PretrainedConfig
5
  attn_config_defaults: Dict = {'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}
6
+ ffn_config_defaults: Dict = {'ffn_type': 'mptmlp'}
7
+ init_config_defaults: Dict = {'name': 'kaiming_normal_', 'fan_mode': 'fan_in', 'init_nonlinearity': 'relu', 'init_div_is_residual': True, 'emb_init_std': None, 'emb_init_uniform_lim': None, 'init_std': None, 'init_gain': 0.0}
8
 
9
  class MPTConfig(PretrainedConfig):
10
  model_type = 'mpt'
11
 
12
+ def __init__(self, d_model: int=2048, n_heads: int=16, n_layers: int=24, expansion_ratio: int=4, max_seq_len: int=2048, vocab_size: int=50368, resid_pdrop: float=0.0, emb_pdrop: float=0.0, learned_pos_emb: bool=True, attn_config: Dict=attn_config_defaults, ffn_config: Dict=ffn_config_defaults, init_device: str='cpu', logit_scale: Optional[Union[float, str]]=None, no_bias: bool=False, embedding_fraction: float=1.0, norm_type: str='low_precision_layernorm', use_cache: bool=False, init_config: Dict=init_config_defaults, fc_type: str='torch', verbose: Optional[int]=None, **kwargs: Any):
13
  """The MPT configuration class.
14
 
15
  Args:
16
  d_model (int): The size of the embedding dimension of the model.
17
  n_heads (int): The number of attention heads.
18
  n_layers (int): The number of layers in the model.
19
+ expansion_ratio (int): The ratio of the up/down scale in the ffn.
20
  max_seq_len (int): The maximum sequence length of the model.
21
  vocab_size (int): The size of the vocabulary.
22
  resid_pdrop (float): The dropout probability applied to the attention output before combining with residual.
23
  emb_pdrop (float): The dropout probability for the embedding layer.
24
  learned_pos_emb (bool): Whether to use learned positional embeddings
25
+ attn_config (Dict): A dictionary used to configure the model's attention module:
26
+ attn_type (str): type of attention to use. Options: multihead_attention, multiquery_attention, grouped_query_attention
27
  attn_pdrop (float): The dropout probability for the attention layers.
28
  attn_impl (str): The attention implementation to use. One of 'torch', 'flash', or 'triton'.
29
  qk_ln (bool): Whether to apply layer normalization to the queries and keys in the attention layer.
 
40
  Defaults to ``False`` meaning any provided `sequence_id` will be ignored.
41
  alibi (bool): Whether to use the alibi bias instead of position embeddings.
42
  alibi_bias_max (int): The maximum value of the alibi bias.
43
+ kv_n_heads (Optional[int]): For grouped_query_attention only, allow user to specify number of kv heads.
44
+ ffn_config (Dict): A dictionary used to configure the model's ffn module:
45
+ ffn_type (str): type of ffn to use. Options: mptmlp, te_ln_mlp
46
  init_device (str): The device to use for parameter initialization.
47
  logit_scale (Optional[Union[float, str]]): If not None, scale the logits by this value.
48
  no_bias (bool): Whether to use bias in all layers.
49
  verbose (int): The verbosity level. 0 is silent.
50
  embedding_fraction (float): The fraction to scale the gradients of the embedding layer by.
51
  norm_type (str): choose type of norm to use
 
52
  use_cache (bool): Whether or not the model should return the last key/values attentions
53
  init_config (Dict): A dictionary used to configure the model initialization:
54
  init_config.name: The parameter initialization scheme to use. Options: 'default_', 'baseline_',
 
65
  init_nonlinearity (str): The nonlinearity to use for parameter initialization with kaiming initialization schemes.
66
  ---
67
  See llmfoundry.models.utils.param_init_fns.py for info on other param init config options
68
+ fc_type (str): choose fc layer implementation. Options: torch and te. te layers support fp8 when using H100 GPUs.
69
  """
70
  self.d_model = d_model
71
  self.n_heads = n_heads
 
77
  self.emb_pdrop = emb_pdrop
78
  self.learned_pos_emb = learned_pos_emb
79
  self.attn_config = attn_config
80
+ self.ffn_config = ffn_config
81
  self.init_device = init_device
82
  self.logit_scale = logit_scale
83
  self.no_bias = no_bias
 
84
  self.embedding_fraction = embedding_fraction
85
  self.norm_type = norm_type
86
  self.use_cache = use_cache
87
  self.init_config = init_config
88
+ self.fc_type = fc_type
89
+ if verbose is not None:
90
+ warnings.warn(DeprecationWarning('verbose argument for MPTConfig is now ignored and will be removed. Use python_log_level instead.'))
91
  if 'name' in kwargs:
92
  del kwargs['name']
93
  if 'loss_fn' in kwargs:
94
  del kwargs['loss_fn']
95
+ if self.attn_config.get('alibi', False):
96
+ self.learned_pos_emb = False
97
+ warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')
98
  super().__init__(**kwargs)
99
  self._validate_config()
100
 
101
+ def _set_config_defaults(self, config: Dict[str, Any], config_defaults: Dict[str, Any]) -> Dict[str, Any]:
102
  for (k, v) in config_defaults.items():
103
  if k not in config:
104
  config[k] = v
105
  return config
106
 
107
+ def _validate_config(self) -> None:
108
  self.attn_config = self._set_config_defaults(self.attn_config, attn_config_defaults)
109
+ self.ffn_config = self._set_config_defaults(self.ffn_config, ffn_config_defaults)
110
  self.init_config = self._set_config_defaults(self.init_config, init_config_defaults)
111
  if self.d_model % self.n_heads != 0:
112
  raise ValueError('d_model must be divisible by n_heads')
 
127
  if self.init_config.get('name', None) is None:
128
  raise ValueError(f"self.init_config={self.init_config!r} 'name' needs to be set.")
129
  if not self.learned_pos_emb and (not self.attn_config['alibi']):
130
+ warnings.warn(f'Positional information not being provided to the model using either learned_pos_emb or alibi.')
131
+ if self.fc_type == 'te' or self.ffn_config['ffn_type'] == 'te_ln_mlp':
132
+ try:
133
+ import transformer_engine.pytorch as te
134
+ del te
135
+ except:
136
+ raise ImportError('TransformerEngine import fail. `fc_type: te` requires TransformerEngine be installed. ' + 'The required version of transformer_engine also requires FlashAttention v1.0.6 is installed:\n' + 'pip install flash-attn==1.0.6 --no-build-isolation \n' + 'pip install git+https://github.com/NVIDIA/TransformerEngine.git@144e4888b2cdd60bd52e706d5b7a79cb9c1a7156')
137
+ if self.ffn_config['ffn_type'] == 'mptmlp':
138
+ self.ffn_config['fc_type'] = self.fc_type
139
+ elif self.ffn_config['ffn_type'] == 'te_ln_mlp':
140
+ self.ffn_config['bias'] = not self.no_bias
custom_embedding.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch.nn as nn
2
+ import torch.nn.functional as F
3
+ from torch import Tensor
4
+
5
+ class SharedEmbedding(nn.Embedding):
6
+
7
+ def forward(self, input: Tensor, unembed: bool=False) -> Tensor:
8
+ if unembed:
9
+ return F.linear(input, self.weight)
10
+ return super().forward(input)
fc.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ from torch import nn
2
+ FC_CLASS_REGISTRY = {'torch': nn.Linear}
3
+ try:
4
+ import transformer_engine.pytorch as te
5
+ FC_CLASS_REGISTRY['te'] = te.Linear
6
+ except:
7
+ pass
ffn.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """GPT Blocks used for the GPT Model."""
2
+ from typing import Any, Optional
3
+ import torch
4
+ import torch.nn as nn
5
+ from .fc import FC_CLASS_REGISTRY
6
+ try:
7
+ import transformer_engine.pytorch as te
8
+ except:
9
+ te = None
10
+
11
+ class MPTMLP(nn.Module):
12
+
13
+ def __init__(self, d_model: int, expansion_ratio: int, fc_type: str='torch', device: Optional[str]=None, bias: bool=True):
14
+ super().__init__()
15
+ fc_kwargs: dict[str, Any] = {'bias': bias}
16
+ if fc_type != 'te':
17
+ fc_kwargs['device'] = device
18
+ self.up_proj = FC_CLASS_REGISTRY[fc_type](d_model, expansion_ratio * d_model, **fc_kwargs)
19
+ self.act = nn.GELU(approximate='none')
20
+ self.down_proj = FC_CLASS_REGISTRY[fc_type](expansion_ratio * d_model, d_model, **fc_kwargs)
21
+ self.down_proj._is_residual = True
22
+
23
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
24
+ return self.down_proj(self.act(self.up_proj(x)))
25
+ FFN_CLASS_REGISTRY = {'mptmlp': MPTMLP}
26
+ if te is not None:
27
+ te.LayerNormMLP._has_norm = True
28
+ FFN_CLASS_REGISTRY['te_ln_mlp'] = te.LayerNormMLP
29
+
30
+ def build_ffn(d_model: int, expansion_ratio: int, fc_type: str='torch', device: Optional[str]=None, bias: bool=True, **kwargs: Any) -> nn.Module:
31
+ ffn_type = kwargs.pop('ffn_type')
32
+ if ffn_type == 'mptmlp':
33
+ if len(kwargs) > 0:
34
+ raise ValueError(f'MPTMLP got an unexpected keyword argument: {kwargs}')
35
+ return MPTMLP(d_model=d_model, expansion_ratio=expansion_ratio, fc_type=fc_type, device=device, bias=bias)
36
+ elif ffn_type == 'te_ln_mlp':
37
+ assert te is not None
38
+ return te.LayerNormMLP(hidden_size=d_model, ffn_hidden_size=d_model * expansion_ratio, bias=bias, **kwargs)
39
+ raise ValueError(f'ffn_type={ffn_type!r} not recognized.')
flash_attn_triton.py ADDED
@@ -0,0 +1,484 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Copied from https://github.com/HazyResearch/flash-attention/blob/eff9fe6b8076df59d64d7a3f464696738a3c7c24/flash_attn/flash_attn_triton.py
3
+ update imports to use 'triton_pre_mlir'
4
+
5
+ *Experimental* implementation of FlashAttention in Triton.
6
+ Tested with triton==2.0.0.dev20221202.
7
+ Triton 2.0 has a new backend (MLIR) but seems like it doesn't yet work for head dimensions
8
+ other than 64:
9
+ https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L207
10
+ We'll update this implementation with the new Triton backend once this is fixed.
11
+
12
+ We use the FlashAttention implementation from Phil Tillet a starting point.
13
+ https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py
14
+
15
+ Changes:
16
+ - Implement both causal and non-causal attention.
17
+ - Implement both self-attention and cross-attention.
18
+ - Support arbitrary seqlens (not just multiples of 128), for both forward and backward.
19
+ - Support all head dimensions up to 128 (not just 16, 32, 64, 128), for both forward and backward.
20
+ - Support attention bias.
21
+ - Speed up the forward pass a bit, and only store the LSE instead of m and l.
22
+ - Make the backward for d=128 much faster by reducing register spilling.
23
+ - Optionally parallelize the backward pass across seqlen_k, to deal with the case of
24
+ small batch size * nheads.
25
+
26
+ Caution:
27
+ - This is an *experimental* implementation. The forward pass should be quite robust but
28
+ I'm not 100% sure that the backward pass doesn't have race conditions (due to the Triton compiler).
29
+ - This implementation has only been tested on A100.
30
+ - If you plan to use headdim other than 64 and 128, you should test for race conditions
31
+ (due to the Triton compiler), as done in tests/test_flash_attn.py
32
+ "test_flash_attn_triton_race_condition". I've tested and fixed many race conditions
33
+ for different head dimensions (40, 48, 64, 128, 80, 88, 96), but I'm still not 100% confident
34
+ that there are none left for other head dimensions.
35
+
36
+ Differences between this Triton version and the CUDA version:
37
+ - Triton version doesn't support dropout.
38
+ - Triton forward is generally faster than CUDA forward, while Triton backward is
39
+ generally slower than CUDA backward. Overall Triton forward + backward is slightly slower
40
+ than CUDA forward + backward.
41
+ - Triton version doesn't support different sequence lengths in a batch (i.e., RaggedTensor/NestedTensor).
42
+ - Triton version supports attention bias, while CUDA version doesn't.
43
+ """
44
+ import math
45
+ import torch
46
+ import triton_pre_mlir as triton
47
+ import triton_pre_mlir.language as tl
48
+
49
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
50
+ @triton.jit
51
+ def _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
52
+ start_m = tl.program_id(0)
53
+ off_hb = tl.program_id(1)
54
+ off_b = off_hb // nheads
55
+ off_h = off_hb % nheads
56
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
57
+ offs_n = tl.arange(0, BLOCK_N)
58
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
59
+ q_ptrs = Q + off_b * stride_qb + off_h * stride_qh + (offs_m[:, None] * stride_qm + offs_d[None, :])
60
+ k_ptrs = K + off_b * stride_kb + off_h * stride_kh + (offs_n[:, None] * stride_kn + offs_d[None, :])
61
+ v_ptrs = V + off_b * stride_vb + off_h * stride_vh + (offs_n[:, None] * stride_vn + offs_d[None, :])
62
+ if BIAS_TYPE == 'vector':
63
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + offs_n
64
+ elif BIAS_TYPE == 'matrix':
65
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + (offs_m[:, None] * stride_bm + offs_n[None, :])
66
+ t_ptrs = TMP + off_hb * seqlen_q_rounded + offs_m
67
+ lse_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
68
+ m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
69
+ acc_o = tl.zeros([BLOCK_M, BLOCK_HEADDIM], dtype=tl.float32)
70
+ if EVEN_M & EVEN_N:
71
+ if EVEN_HEADDIM:
72
+ q = tl.load(q_ptrs)
73
+ else:
74
+ q = tl.load(q_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
75
+ elif EVEN_HEADDIM:
76
+ q = tl.load(q_ptrs, mask=offs_m[:, None] < seqlen_q, other=0.0)
77
+ else:
78
+ q = tl.load(q_ptrs, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
79
+ end_n = seqlen_k if not IS_CAUSAL else tl.minimum((start_m + 1) * BLOCK_M, seqlen_k)
80
+ for start_n in range(0, end_n, BLOCK_N):
81
+ start_n = tl.multiple_of(start_n, BLOCK_N)
82
+ if EVEN_N & EVEN_M:
83
+ if EVEN_HEADDIM:
84
+ k = tl.load(k_ptrs + start_n * stride_kn)
85
+ else:
86
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=offs_d[None, :] < headdim, other=0.0)
87
+ elif EVEN_HEADDIM:
88
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
89
+ else:
90
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
91
+ qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
92
+ qk += tl.dot(q, k, trans_b=True)
93
+ if not EVEN_N:
94
+ qk += tl.where((start_n + offs_n)[None, :] < seqlen_k, 0, float('-inf'))
95
+ if IS_CAUSAL:
96
+ qk += tl.where(offs_m[:, None] >= (start_n + offs_n)[None, :], 0, float('-inf'))
97
+ if BIAS_TYPE != 'none':
98
+ if BIAS_TYPE == 'vector':
99
+ if EVEN_N:
100
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
101
+ else:
102
+ bias = tl.load(b_ptrs + start_n, mask=start_n + offs_n < seqlen_k, other=0.0).to(tl.float32)
103
+ bias = bias[None, :]
104
+ elif BIAS_TYPE == 'matrix':
105
+ if EVEN_M & EVEN_N:
106
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
107
+ else:
108
+ bias = tl.load(b_ptrs + start_n, mask=(offs_m[:, None] < seqlen_q) & ((start_n + offs_n)[None, :] < seqlen_k), other=0.0).to(tl.float32)
109
+ qk = qk * softmax_scale + bias
110
+ m_ij = tl.maximum(tl.max(qk, 1), lse_i)
111
+ p = tl.exp(qk - m_ij[:, None])
112
+ else:
113
+ m_ij = tl.maximum(tl.max(qk, 1) * softmax_scale, lse_i)
114
+ p = tl.exp(qk * softmax_scale - m_ij[:, None])
115
+ l_ij = tl.sum(p, 1)
116
+ acc_o_scale = tl.exp(m_i - m_ij)
117
+ tl.store(t_ptrs, acc_o_scale)
118
+ acc_o_scale = tl.load(t_ptrs)
119
+ acc_o = acc_o * acc_o_scale[:, None]
120
+ if EVEN_N & EVEN_M:
121
+ if EVEN_HEADDIM:
122
+ v = tl.load(v_ptrs + start_n * stride_vn)
123
+ else:
124
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=offs_d[None, :] < headdim, other=0.0)
125
+ elif EVEN_HEADDIM:
126
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
127
+ else:
128
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
129
+ p = p.to(v.dtype)
130
+ acc_o += tl.dot(p, v)
131
+ m_i = m_ij
132
+ l_i_new = tl.exp(lse_i - m_ij) + l_ij
133
+ lse_i = m_ij + tl.log(l_i_new)
134
+ o_scale = tl.exp(m_i - lse_i)
135
+ tl.store(t_ptrs, o_scale)
136
+ o_scale = tl.load(t_ptrs)
137
+ acc_o = acc_o * o_scale[:, None]
138
+ start_m = tl.program_id(0)
139
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
140
+ lse_ptrs = Lse + off_hb * seqlen_q_rounded + offs_m
141
+ tl.store(lse_ptrs, lse_i)
142
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
143
+ out_ptrs = Out + off_b * stride_ob + off_h * stride_oh + (offs_m[:, None] * stride_om + offs_d[None, :])
144
+ if EVEN_M:
145
+ if EVEN_HEADDIM:
146
+ tl.store(out_ptrs, acc_o)
147
+ else:
148
+ tl.store(out_ptrs, acc_o, mask=offs_d[None, :] < headdim)
149
+ elif EVEN_HEADDIM:
150
+ tl.store(out_ptrs, acc_o, mask=offs_m[:, None] < seqlen_q)
151
+ else:
152
+ tl.store(out_ptrs, acc_o, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
153
+
154
+ @triton.jit
155
+ def _bwd_preprocess_do_o_dot(Out, DO, Delta, stride_ob, stride_oh, stride_om, stride_dob, stride_doh, stride_dom, nheads, seqlen_q, seqlen_q_rounded, headdim, BLOCK_M: tl.constexpr, BLOCK_HEADDIM: tl.constexpr):
156
+ start_m = tl.program_id(0)
157
+ off_hb = tl.program_id(1)
158
+ off_b = off_hb // nheads
159
+ off_h = off_hb % nheads
160
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
161
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
162
+ o = tl.load(Out + off_b * stride_ob + off_h * stride_oh + offs_m[:, None] * stride_om + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
163
+ do = tl.load(DO + off_b * stride_dob + off_h * stride_doh + offs_m[:, None] * stride_dom + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
164
+ delta = tl.sum(o * do, axis=1)
165
+ tl.store(Delta + off_hb * seqlen_q_rounded + offs_m, delta)
166
+
167
+ @triton.jit
168
+ def _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr):
169
+ if EVEN_N & EVEN_M:
170
+ if EVEN_HEADDIM:
171
+ tl.store(dv_ptrs, dv)
172
+ tl.store(dk_ptrs, dk)
173
+ else:
174
+ tl.store(dv_ptrs, dv, mask=offs_d[None, :] < headdim)
175
+ tl.store(dk_ptrs, dk, mask=offs_d[None, :] < headdim)
176
+ elif EVEN_HEADDIM:
177
+ tl.store(dv_ptrs, dv, mask=offs_n[:, None] < seqlen_k)
178
+ tl.store(dk_ptrs, dk, mask=offs_n[:, None] < seqlen_k)
179
+ else:
180
+ tl.store(dv_ptrs, dv, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
181
+ tl.store(dk_ptrs, dk, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
182
+
183
+ @triton.jit
184
+ def _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD: tl.constexpr, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
185
+ begin_m = 0 if not IS_CAUSAL else start_n * BLOCK_N // BLOCK_M * BLOCK_M
186
+ offs_qm = begin_m + tl.arange(0, BLOCK_M)
187
+ offs_n = start_n * BLOCK_N + tl.arange(0, BLOCK_N)
188
+ offs_m = tl.arange(0, BLOCK_M)
189
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
190
+ q_ptrs = Q + (offs_qm[:, None] * stride_qm + offs_d[None, :])
191
+ k_ptrs = K + (offs_n[:, None] * stride_kn + offs_d[None, :])
192
+ v_ptrs = V + (offs_n[:, None] * stride_vn + offs_d[None, :])
193
+ do_ptrs = DO + (offs_qm[:, None] * stride_dom + offs_d[None, :])
194
+ dq_ptrs = DQ + (offs_qm[:, None] * stride_dqm + offs_d[None, :])
195
+ if BIAS_TYPE == 'vector':
196
+ b_ptrs = Bias + offs_n
197
+ elif BIAS_TYPE == 'matrix':
198
+ b_ptrs = Bias + (offs_qm[:, None] * stride_bm + offs_n[None, :])
199
+ dv = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
200
+ dk = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
201
+ if begin_m >= seqlen_q:
202
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
203
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
204
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
205
+ return
206
+ if EVEN_N & EVEN_M:
207
+ if EVEN_HEADDIM:
208
+ k = tl.load(k_ptrs)
209
+ v = tl.load(v_ptrs)
210
+ else:
211
+ k = tl.load(k_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
212
+ v = tl.load(v_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
213
+ elif EVEN_HEADDIM:
214
+ k = tl.load(k_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
215
+ v = tl.load(v_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
216
+ else:
217
+ k = tl.load(k_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
218
+ v = tl.load(v_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
219
+ num_block_m = tl.cdiv(seqlen_q, BLOCK_M)
220
+ for start_m in range(begin_m, num_block_m * BLOCK_M, BLOCK_M):
221
+ start_m = tl.multiple_of(start_m, BLOCK_M)
222
+ offs_m_curr = start_m + offs_m
223
+ if EVEN_M & EVEN_HEADDIM:
224
+ q = tl.load(q_ptrs)
225
+ elif EVEN_HEADDIM:
226
+ q = tl.load(q_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0)
227
+ else:
228
+ q = tl.load(q_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
229
+ qk = tl.dot(q, k, trans_b=True)
230
+ if not EVEN_N:
231
+ qk = tl.where(offs_n[None, :] < seqlen_k, qk, float('-inf'))
232
+ if IS_CAUSAL:
233
+ qk = tl.where(offs_m_curr[:, None] >= offs_n[None, :], qk, float('-inf'))
234
+ if BIAS_TYPE != 'none':
235
+ tl.debug_barrier()
236
+ if BIAS_TYPE == 'vector':
237
+ if EVEN_N:
238
+ bias = tl.load(b_ptrs).to(tl.float32)
239
+ else:
240
+ bias = tl.load(b_ptrs, mask=offs_n < seqlen_k, other=0.0).to(tl.float32)
241
+ bias = bias[None, :]
242
+ elif BIAS_TYPE == 'matrix':
243
+ if EVEN_M & EVEN_N:
244
+ bias = tl.load(b_ptrs).to(tl.float32)
245
+ else:
246
+ bias = tl.load(b_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_n[None, :] < seqlen_k), other=0.0).to(tl.float32)
247
+ qk = qk * softmax_scale + bias
248
+ if not EVEN_M & EVEN_HEADDIM:
249
+ tl.debug_barrier()
250
+ lse_i = tl.load(LSE + offs_m_curr)
251
+ if BIAS_TYPE == 'none':
252
+ p = tl.exp(qk * softmax_scale - lse_i[:, None])
253
+ else:
254
+ p = tl.exp(qk - lse_i[:, None])
255
+ if EVEN_M & EVEN_HEADDIM:
256
+ do = tl.load(do_ptrs)
257
+ else:
258
+ do = tl.load(do_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
259
+ dv += tl.dot(p.to(do.dtype), do, trans_a=True)
260
+ if not EVEN_M & EVEN_HEADDIM:
261
+ tl.debug_barrier()
262
+ dp = tl.dot(do, v, trans_b=True)
263
+ if not EVEN_HEADDIM:
264
+ tl.debug_barrier()
265
+ Di = tl.load(D + offs_m_curr)
266
+ ds = (p * (dp - Di[:, None]) * softmax_scale).to(q.dtype)
267
+ dk += tl.dot(ds, q, trans_a=True)
268
+ if not EVEN_M & EVEN_HEADDIM:
269
+ tl.debug_barrier()
270
+ if not ATOMIC_ADD:
271
+ if EVEN_M & EVEN_HEADDIM:
272
+ dq = tl.load(dq_ptrs, eviction_policy='evict_last')
273
+ dq += tl.dot(ds, k)
274
+ tl.store(dq_ptrs, dq, eviction_policy='evict_last')
275
+ elif EVEN_HEADDIM:
276
+ dq = tl.load(dq_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0, eviction_policy='evict_last')
277
+ dq += tl.dot(ds, k)
278
+ tl.store(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q, eviction_policy='evict_last')
279
+ else:
280
+ dq = tl.load(dq_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0, eviction_policy='evict_last')
281
+ dq += tl.dot(ds, k)
282
+ tl.store(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), eviction_policy='evict_last')
283
+ else:
284
+ dq = tl.dot(ds, k)
285
+ if EVEN_M & EVEN_HEADDIM:
286
+ tl.atomic_add(dq_ptrs, dq)
287
+ elif EVEN_HEADDIM:
288
+ tl.atomic_add(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q)
289
+ else:
290
+ tl.atomic_add(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
291
+ dq_ptrs += BLOCK_M * stride_dqm
292
+ q_ptrs += BLOCK_M * stride_qm
293
+ do_ptrs += BLOCK_M * stride_dom
294
+ if BIAS_TYPE == 'matrix':
295
+ b_ptrs += BLOCK_M * stride_bm
296
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
297
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
298
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
299
+
300
+ def init_to_zero(name):
301
+ return lambda nargs: nargs[name].zero_()
302
+
303
+ @triton.autotune(configs=[triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': False}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ')), triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': True}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ'))], key=['CACHE_KEY_SEQLEN_Q', 'CACHE_KEY_SEQLEN_K', 'BIAS_TYPE', 'IS_CAUSAL', 'BLOCK_HEADDIM'])
304
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
305
+ @triton.jit
306
+ def _bwd_kernel(Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_dob, stride_doh, stride_dom, stride_dqb, stride_dqh, stride_dqm, stride_dkb, stride_dkh, stride_dkn, stride_dvb, stride_dvh, stride_dvn, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, SEQUENCE_PARALLEL: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
307
+ off_hb = tl.program_id(1)
308
+ off_b = off_hb // nheads
309
+ off_h = off_hb % nheads
310
+ Q += off_b * stride_qb + off_h * stride_qh
311
+ K += off_b * stride_kb + off_h * stride_kh
312
+ V += off_b * stride_vb + off_h * stride_vh
313
+ DO += off_b * stride_dob + off_h * stride_doh
314
+ DQ += off_b * stride_dqb + off_h * stride_dqh
315
+ DK += off_b * stride_dkb + off_h * stride_dkh
316
+ DV += off_b * stride_dvb + off_h * stride_dvh
317
+ if BIAS_TYPE != 'none':
318
+ Bias += off_b * stride_bb + off_h * stride_bh
319
+ D += off_hb * seqlen_q_rounded
320
+ LSE += off_hb * seqlen_q_rounded
321
+ if not SEQUENCE_PARALLEL:
322
+ num_block_n = tl.cdiv(seqlen_k, BLOCK_N)
323
+ for start_n in range(0, num_block_n):
324
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=False, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
325
+ else:
326
+ start_n = tl.program_id(0)
327
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=True, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
328
+
329
+ def _flash_attn_forward(q, k, v, bias=None, causal=False, softmax_scale=None):
330
+ (batch, seqlen_q, nheads, d) = q.shape
331
+ (_, seqlen_k, _, _) = k.shape
332
+ assert k.shape == (batch, seqlen_k, nheads, d)
333
+ assert v.shape == (batch, seqlen_k, nheads, d)
334
+ assert d <= 128, 'FlashAttention only support head dimensions up to 128'
335
+ assert q.dtype == k.dtype == v.dtype, 'All tensors must have the same type'
336
+ assert q.dtype in [torch.float16, torch.bfloat16], 'Only support fp16 and bf16'
337
+ assert q.is_cuda and k.is_cuda and v.is_cuda
338
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
339
+ has_bias = bias is not None
340
+ bias_type = 'none'
341
+ if has_bias:
342
+ assert bias.dtype in [q.dtype, torch.float]
343
+ assert bias.is_cuda
344
+ assert bias.dim() == 4
345
+ if bias.stride(-1) != 1:
346
+ bias = bias.contiguous()
347
+ if bias.shape[2:] == (1, seqlen_k):
348
+ bias_type = 'vector'
349
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
350
+ bias_type = 'matrix'
351
+ else:
352
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
353
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
354
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
355
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
356
+ lse = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
357
+ tmp = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
358
+ o = torch.empty_like(q)
359
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
360
+ BLOCK = 128
361
+ num_warps = 4 if d <= 64 else 8
362
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
363
+ _fwd_kernel[grid](q, k, v, bias, o, lse, tmp, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, o.stride(0), o.stride(2), o.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM, BLOCK_M=BLOCK, BLOCK_N=BLOCK, num_warps=num_warps, num_stages=1)
364
+ return (o, lse, softmax_scale)
365
+
366
+ def _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=None, causal=False, softmax_scale=None):
367
+ if do.stride(-1) != 1:
368
+ do = do.contiguous()
369
+ (batch, seqlen_q, nheads, d) = q.shape
370
+ (_, seqlen_k, _, _) = k.shape
371
+ assert d <= 128
372
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
373
+ assert lse.shape == (batch, nheads, seqlen_q_rounded)
374
+ assert q.stride(-1) == k.stride(-1) == v.stride(-1) == o.stride(-1) == 1
375
+ assert dq.stride(-1) == dk.stride(-1) == dv.stride(-1) == 1
376
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
377
+ dq_accum = torch.empty_like(q, dtype=torch.float32)
378
+ delta = torch.empty_like(lse)
379
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
380
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
381
+ _bwd_preprocess_do_o_dot[grid](o, do, delta, o.stride(0), o.stride(2), o.stride(1), do.stride(0), do.stride(2), do.stride(1), nheads, seqlen_q, seqlen_q_rounded, d, BLOCK_M=128, BLOCK_HEADDIM=BLOCK_HEADDIM)
382
+ has_bias = bias is not None
383
+ bias_type = 'none'
384
+ if has_bias:
385
+ assert bias.dtype in [q.dtype, torch.float]
386
+ assert bias.is_cuda
387
+ assert bias.dim() == 4
388
+ assert bias.stride(-1) == 1
389
+ if bias.shape[2:] == (1, seqlen_k):
390
+ bias_type = 'vector'
391
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
392
+ bias_type = 'matrix'
393
+ else:
394
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
395
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
396
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
397
+ grid = lambda META: (triton.cdiv(seqlen_k, META['BLOCK_N']) if META['SEQUENCE_PARALLEL'] else 1, batch * nheads)
398
+ _bwd_kernel[grid](q, k, v, bias, do, dq_accum, dk, dv, lse, delta, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, do.stride(0), do.stride(2), do.stride(1), dq_accum.stride(0), dq_accum.stride(2), dq_accum.stride(1), dk.stride(0), dk.stride(2), dk.stride(1), dv.stride(0), dv.stride(2), dv.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM)
399
+ dq.copy_(dq_accum)
400
+
401
+ class FlashAttnQKVPackedFunc(torch.autograd.Function):
402
+
403
+ @staticmethod
404
+ def forward(ctx, qkv, bias=None, causal=False, softmax_scale=None):
405
+ """
406
+ qkv: (batch, seqlen, 3, nheads, headdim)
407
+ bias: optional, shape broadcastible to (batch, nheads, seqlen, seqlen).
408
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen).
409
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen, seqlen)
410
+ """
411
+ if qkv.stride(-1) != 1:
412
+ qkv = qkv.contiguous()
413
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], bias=bias, causal=causal, softmax_scale=softmax_scale)
414
+ ctx.save_for_backward(qkv, o, lse, bias)
415
+ ctx.causal = causal
416
+ return o
417
+
418
+ @staticmethod
419
+ def backward(ctx, do):
420
+ (qkv, o, lse, bias) = ctx.saved_tensors
421
+ assert not ctx.needs_input_grad[1], 'FlashAttention does not support bias gradient yet'
422
+ with torch.inference_mode():
423
+ dqkv = torch.empty_like(qkv)
424
+ _flash_attn_backward(do, qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], o, lse, dqkv[:, :, 0], dqkv[:, :, 1], dqkv[:, :, 2], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
425
+ return (dqkv, None, None, None)
426
+ flash_attn_qkvpacked_func = FlashAttnQKVPackedFunc.apply
427
+
428
+ class FlashAttnKVPackedFunc(torch.autograd.Function):
429
+
430
+ @staticmethod
431
+ def forward(ctx, q, kv, bias=None, causal=False, softmax_scale=None):
432
+ """
433
+ q: (batch, seqlen_q, nheads, headdim)
434
+ kv: (batch, seqlen_k, 2, nheads, headdim)
435
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
436
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
437
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
438
+ """
439
+ (q, kv) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, kv]]
440
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, kv[:, :, 0], kv[:, :, 1], bias=bias, causal=causal, softmax_scale=softmax_scale)
441
+ ctx.save_for_backward(q, kv, o, lse, bias)
442
+ ctx.causal = causal
443
+ return o
444
+
445
+ @staticmethod
446
+ def backward(ctx, do):
447
+ (q, kv, o, lse, bias) = ctx.saved_tensors
448
+ if len(ctx.needs_input_grad) >= 3:
449
+ assert not ctx.needs_input_grad[2], 'FlashAttention does not support bias gradient yet'
450
+ with torch.inference_mode():
451
+ dq = torch.empty_like(q)
452
+ dkv = torch.empty_like(kv)
453
+ _flash_attn_backward(do, q, kv[:, :, 0], kv[:, :, 1], o, lse, dq, dkv[:, :, 0], dkv[:, :, 1], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
454
+ return (dq, dkv, None, None, None)
455
+ flash_attn_kvpacked_func = FlashAttnKVPackedFunc.apply
456
+
457
+ class FlashAttnFunc(torch.autograd.Function):
458
+
459
+ @staticmethod
460
+ def forward(ctx, q, k, v, bias=None, causal=False, softmax_scale=None):
461
+ """
462
+ q: (batch_size, seqlen_q, nheads, headdim)
463
+ k, v: (batch_size, seqlen_k, nheads, headdim)
464
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
465
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
466
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
467
+ """
468
+ (q, k, v) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, k, v]]
469
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, k, v, bias=bias, causal=causal, softmax_scale=softmax_scale)
470
+ ctx.save_for_backward(q, k, v, o, lse, bias)
471
+ ctx.causal = causal
472
+ return o
473
+
474
+ @staticmethod
475
+ def backward(ctx, do):
476
+ (q, k, v, o, lse, bias) = ctx.saved_tensors
477
+ assert not ctx.needs_input_grad[3], 'FlashAttention does not support bias gradient yet'
478
+ with torch.inference_mode():
479
+ dq = torch.empty_like(q)
480
+ dk = torch.empty_like(k)
481
+ dv = torch.empty_like(v)
482
+ _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
483
+ return (dq, dk, dv, None, None, None)
484
+ flash_attn_func = FlashAttnFunc.apply
generation_config.json CHANGED
@@ -1,5 +1,6 @@
1
  {
2
  "_from_model_config": true,
3
  "transformers_version": "4.28.1",
 
4
  "use_cache": false
5
  }
 
1
  {
2
  "_from_model_config": true,
3
  "transformers_version": "4.28.1",
4
+ "eos_token_id": 0,
5
  "use_cache": false
6
  }
hf_prefixlm_converter.py CHANGED
@@ -6,23 +6,13 @@ Causal LM to convert it to a Prefix LM.
6
  Prefix LMs accepts a `bidirectional_mask` input in `forward`
7
  and treat the input prompt as the prefix in `generate`.
8
  """
9
- import math
10
- import warnings
11
  from types import MethodType
12
- from typing import Any, Dict, List, Optional, Tuple, Union
13
  import torch
14
- from transformers.models.bloom.modeling_bloom import BaseModelOutputWithPastAndCrossAttentions, BloomForCausalLM, BloomModel, CausalLMOutputWithCrossAttentions, CrossEntropyLoss
15
- from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
16
- from transformers.models.bloom.modeling_bloom import _make_causal_mask as _make_causal_mask_bloom
17
- from transformers.models.bloom.modeling_bloom import logging
18
  from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel
19
  from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoForCausalLM
20
  from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXForCausalLM
21
  from transformers.models.gptj.modeling_gptj import GPTJForCausalLM
22
- from transformers.models.opt.modeling_opt import OPTForCausalLM
23
- from transformers.models.opt.modeling_opt import _expand_mask as _expand_mask_opt
24
- from transformers.models.opt.modeling_opt import _make_causal_mask as _make_causal_mask_opt
25
- logger = logging.get_logger(__name__)
26
  _SUPPORTED_GPT_MODELS = (GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM)
27
  CAUSAL_GPT_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM]
28
 
@@ -90,13 +80,14 @@ def _convert_gpt_causal_lm_to_prefix_lm(model: CAUSAL_GPT_TYPES) -> CAUSAL_GPT_T
90
  bidirectional_mask = torch.cat([bidirectional_mask, pad], dim=1)
91
  bidirectional = bidirectional_mask.unsqueeze(1).unsqueeze(1)
92
  for attn_module in attn_modules:
 
93
  attn_module.bias.data = torch.logical_or(attn_module.bias.data, bidirectional)
94
  output = call_og_forward()
95
  for attn_module in attn_modules:
96
  attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
97
  return output
98
 
99
- def generate(self: CAUSAL_GPT_TYPES, *args: tuple, **kwargs: Dict[str, Any]):
100
  """Wraps original generate to enable PrefixLM attention."""
101
  attn_modules = _get_attn_modules(model)
102
  for attn_module in attn_modules:
@@ -109,228 +100,8 @@ def _convert_gpt_causal_lm_to_prefix_lm(model: CAUSAL_GPT_TYPES) -> CAUSAL_GPT_T
109
  setattr(model, 'generate', MethodType(generate, model))
110
  setattr(model, '_prefix_lm_converted', True)
111
  return model
112
-
113
- def _convert_bloom_causal_lm_to_prefix_lm(model: BloomForCausalLM) -> BloomForCausalLM:
114
- """Converts a BLOOM Causal LM to a Prefix LM.
115
-
116
- Supported HuggingFace model classes:
117
- - `BloomForCausalLM`
118
-
119
- See `convert_hf_causal_lm_to_prefix_lm` for more details.
120
- """
121
- if hasattr(model, '_prefix_lm_converted'):
122
- return model
123
- assert isinstance(model, BloomForCausalLM)
124
- assert model.config.add_cross_attention == False, 'Only supports BLOOM decoder-only models'
125
-
126
- def _prepare_attn_mask(self: BloomModel, attention_mask: torch.Tensor, bidirectional_mask: Optional[torch.Tensor], input_shape: Tuple[int, int], past_key_values_length: int) -> torch.BoolTensor:
127
- combined_attention_mask = None
128
- device = attention_mask.device
129
- (_, src_length) = input_shape
130
- if src_length > 1:
131
- combined_attention_mask = _make_causal_mask_bloom(input_shape, device=device, past_key_values_length=past_key_values_length)
132
- if bidirectional_mask is not None:
133
- assert attention_mask.shape == bidirectional_mask.shape
134
- expanded_bidirectional_mask = _expand_mask_bloom(bidirectional_mask, tgt_length=src_length)
135
- combined_attention_mask = torch.logical_and(combined_attention_mask, expanded_bidirectional_mask)
136
- expanded_attn_mask = _expand_mask_bloom(attention_mask, tgt_length=src_length)
137
- combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
138
- return combined_attention_mask
139
-
140
- def _build_alibi_tensor(self: BloomModel, batch_size: int, query_length: int, key_length: int, dtype: torch.dtype, device: torch.device) -> torch.Tensor:
141
- num_heads = self.config.n_head
142
- closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
143
- base = torch.tensor(2 ** (-2 ** (-(math.log2(closest_power_of_2) - 3))), device=device, dtype=torch.float32)
144
- powers = torch.arange(1, 1 + closest_power_of_2, device=device, dtype=torch.int32)
145
- slopes = torch.pow(base, powers)
146
- if closest_power_of_2 != num_heads:
147
- extra_base = torch.tensor(2 ** (-2 ** (-(math.log2(2 * closest_power_of_2) - 3))), device=device, dtype=torch.float32)
148
- num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
149
- extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=device, dtype=torch.int32)
150
- slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
151
- qa = torch.arange(query_length, device=device, dtype=torch.int32).view(-1, 1)
152
- ka = torch.arange(key_length, device=device, dtype=torch.int32).view(1, -1)
153
- diffs = qa - ka + key_length - query_length
154
- diffs = -diffs.abs()
155
- alibi = slopes.view(1, num_heads, 1, 1) * diffs.view(1, 1, query_length, key_length)
156
- alibi = alibi.expand(batch_size, -1, -1, -1).reshape(-1, query_length, key_length)
157
- return alibi.to(dtype)
158
- KeyValueT = Tuple[torch.Tensor, torch.Tensor]
159
-
160
- def forward(self: BloomModel, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.LongTensor]=None, inputs_embeds: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
161
- if deprecated_arguments.pop('position_ids', False) is not False:
162
- warnings.warn('`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. ' + 'You can safely ignore passing `position_ids`.', FutureWarning)
163
- if len(deprecated_arguments) > 0:
164
- raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
165
- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
166
- output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
167
- use_cache = use_cache if use_cache is not None else self.config.use_cache
168
- return_dict = return_dict if return_dict is not None else self.config.use_return_dict
169
- if input_ids is not None and inputs_embeds is not None:
170
- raise ValueError('You cannot specify both input_ids and inputs_embeds at the same time')
171
- elif input_ids is not None:
172
- (batch_size, seq_length) = input_ids.shape
173
- elif inputs_embeds is not None:
174
- (batch_size, seq_length, _) = inputs_embeds.shape
175
- else:
176
- raise ValueError('You have to specify either input_ids or inputs_embeds')
177
- if past_key_values is None:
178
- past_key_values = tuple([None] * len(self.h))
179
- head_mask = self.get_head_mask(head_mask, self.config.n_layer)
180
- if inputs_embeds is None:
181
- inputs_embeds = self.word_embeddings(input_ids)
182
- hidden_states = self.word_embeddings_layernorm(inputs_embeds)
183
- presents = () if use_cache else None
184
- all_self_attentions = () if output_attentions else None
185
- all_hidden_states = () if output_hidden_states else None
186
- seq_length_with_past = seq_length
187
- past_key_values_length = 0
188
- if past_key_values[0] is not None:
189
- tmp = past_key_values[0][0]
190
- past_key_values_length = tmp.shape[2]
191
- seq_length_with_past = seq_length_with_past + past_key_values_length
192
- if attention_mask is None:
193
- attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device)
194
- else:
195
- attention_mask = attention_mask.to(hidden_states.device)
196
- alibi = self._build_alibi_tensor(batch_size=batch_size, query_length=seq_length, key_length=seq_length_with_past, dtype=hidden_states.dtype, device=hidden_states.device)
197
- causal_mask = self._prepare_attn_mask(attention_mask, bidirectional_mask, input_shape=(batch_size, seq_length), past_key_values_length=past_key_values_length)
198
- for (i, (block, layer_past)) in enumerate(zip(self.h, past_key_values)):
199
- if output_hidden_states:
200
- hst = (hidden_states,)
201
- all_hidden_states = all_hidden_states + hst
202
- if self.gradient_checkpointing and self.training:
203
- if use_cache:
204
- logger.warning('`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...')
205
- use_cache = False
206
-
207
- def create_custom_forward(module):
208
-
209
- def custom_forward(*inputs):
210
- return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
211
- return custom_forward
212
- outputs = torch.utils.checkpoint.checkpoint(create_custom_forward(block), hidden_states, alibi, causal_mask, head_mask[i])
213
- else:
214
- outputs = block(hidden_states, layer_past=layer_past, attention_mask=causal_mask, head_mask=head_mask[i], use_cache=use_cache, output_attentions=output_attentions, alibi=alibi)
215
- hidden_states = outputs[0]
216
- if use_cache is True:
217
- presents = presents + (outputs[1],)
218
- if output_attentions:
219
- oa = (outputs[2 if use_cache else 1],)
220
- all_self_attentions = all_self_attentions + oa
221
- hidden_states = self.ln_f(hidden_states)
222
- if output_hidden_states:
223
- hst = (hidden_states,)
224
- all_hidden_states = all_hidden_states + hst
225
- if not return_dict:
226
- return tuple((v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None))
227
- return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=hidden_states, past_key_values=presents, hidden_states=all_hidden_states, attentions=all_self_attentions)
228
- setattr(model.transformer, '_prepare_attn_mask', MethodType(_prepare_attn_mask, model.transformer))
229
- setattr(model.transformer, '_build_alibi_tensor', MethodType(_build_alibi_tensor, model.transformer))
230
- setattr(model.transformer, 'forward', MethodType(forward, model.transformer))
231
- KeyValueT = Tuple[torch.Tensor, torch.Tensor]
232
-
233
- def forward(self: BloomForCausalLM, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.Tensor]=None, inputs_embeds: Optional[torch.Tensor]=None, labels: Optional[torch.Tensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
234
- """Replacement forward method for BloomCausalLM."""
235
- if deprecated_arguments.pop('position_ids', False) is not False:
236
- warnings.warn('`position_ids` have no functionality in BLOOM and will be removed ' + 'in v5.0.0. You can safely ignore passing `position_ids`.', FutureWarning)
237
- if len(deprecated_arguments) > 0:
238
- raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
239
- return_dict = return_dict if return_dict is not None else self.config.use_return_dict
240
- transformer_outputs = self.transformer(input_ids, past_key_values=past_key_values, attention_mask=attention_mask, bidirectional_mask=bidirectional_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
241
- hidden_states = transformer_outputs[0]
242
- lm_logits = self.lm_head(hidden_states)
243
- loss = None
244
- if labels is not None:
245
- shift_logits = lm_logits[..., :-1, :].contiguous()
246
- shift_labels = labels[..., 1:].contiguous()
247
- (batch_size, seq_length, vocab_size) = shift_logits.shape
248
- loss_fct = CrossEntropyLoss()
249
- loss = loss_fct(shift_logits.view(batch_size * seq_length, vocab_size), shift_labels.view(batch_size * seq_length))
250
- if not return_dict:
251
- output = (lm_logits,) + transformer_outputs[1:]
252
- return (loss,) + output if loss is not None else output
253
- return CausalLMOutputWithCrossAttentions(loss=loss, logits=lm_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions)
254
-
255
- def prepare_inputs_for_generation(self: BloomForCausalLM, input_ids: torch.LongTensor, past: Optional[torch.Tensor]=None, attention_mask: Optional[torch.Tensor]=None, **kwargs) -> dict:
256
- if past:
257
- input_ids = input_ids[:, -1].unsqueeze(-1)
258
- bidirectional_mask = None
259
- if past[0][0].shape[0] == input_ids.shape[0]:
260
- past = self._convert_to_bloom_cache(past)
261
- else:
262
- bidirectional_mask = torch.ones_like(input_ids)
263
- return {'input_ids': input_ids, 'past_key_values': past, 'use_cache': True, 'attention_mask': attention_mask, 'bidirectional_mask': bidirectional_mask}
264
- setattr(model, 'forward', MethodType(forward, model))
265
- setattr(model, 'prepare_inputs_for_generation', MethodType(prepare_inputs_for_generation, model))
266
- setattr(model, '_prefix_lm_converted', True)
267
- return model
268
-
269
- def _convert_opt_causal_lm_to_prefix_lm(model: OPTForCausalLM) -> OPTForCausalLM:
270
- """Converts an OPT Causal LM to a Prefix LM.
271
-
272
- Supported HuggingFace model classes:
273
- - `OPTForCausalLM`
274
-
275
- See `convert_hf_causal_lm_to_prefix_lm` for more details.
276
- """
277
- if hasattr(model, '_prefix_lm_converted'):
278
- return model
279
- assert isinstance(model, OPTForCausalLM)
280
- assert model.config.add_cross_attention == False, 'Only supports OPT decoder-only models'
281
- setattr(model, '_original_forward', getattr(model, 'forward'))
282
- setattr(model, '_original_generate', getattr(model, 'generate'))
283
- model.model.decoder.bidirectional_mask = None
284
-
285
- def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
286
- combined_attention_mask = None
287
- if input_shape[-1] > 1:
288
- if self.bidirectional_mask == 'g':
289
- (bsz, src_length) = input_shape
290
- combined_attention_mask = torch.zeros((bsz, 1, src_length, src_length + past_key_values_length), dtype=inputs_embeds.dtype, device=inputs_embeds.device)
291
- else:
292
- combined_attention_mask = _make_causal_mask_opt(input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length).to(inputs_embeds.device)
293
- if self.bidirectional_mask is not None:
294
- assert attention_mask.shape == self.bidirectional_mask.shape
295
- expanded_bidirectional_mask = _expand_mask_opt(self.bidirectional_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
296
- combined_attention_mask = torch.maximum(expanded_bidirectional_mask, combined_attention_mask)
297
- if attention_mask is not None:
298
- expanded_attn_mask = _expand_mask_opt(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
299
- combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
300
- return combined_attention_mask
301
- setattr(model.model.decoder, '_prepare_decoder_attention_mask', MethodType(_prepare_decoder_attention_mask, model.model.decoder))
302
-
303
- def forward(self: OPTForCausalLM, input_ids: Optional[torch.LongTensor]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.ByteTensor]=None, head_mask: Optional[torch.Tensor]=None, past_key_values: Optional[List[torch.FloatTensor]]=None, inputs_embeds: Optional[torch.FloatTensor]=None, labels: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None):
304
-
305
- def call_og_forward():
306
- return self._original_forward(input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, past_key_values=past_key_values, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
307
- if bidirectional_mask is None:
308
- return call_og_forward()
309
- self.model.decoder.bidirectional_mask = bidirectional_mask
310
- try:
311
- outputs = call_og_forward()
312
- except:
313
- self.model.decoder.bidirectional_mask = None
314
- raise
315
- self.model.decoder.bidirectional_mask = None
316
- return outputs
317
-
318
- def generate(self: OPTForCausalLM, *args: tuple, **kwargs: Dict[str, Any]):
319
- """Wraps original generate to enable PrefixLM-style attention."""
320
- self.model.decoder.bidirectional_mask = 'g'
321
- try:
322
- output = self._original_generate(*args, **kwargs)
323
- except:
324
- self.model.decoder.bidirectional_mask = None
325
- raise
326
- self.model.decoder.bidirectional_mask = None
327
- return output
328
- setattr(model, 'forward', MethodType(forward, model))
329
- setattr(model, 'generate', MethodType(generate, model))
330
- setattr(model, '_prefix_lm_converted', True)
331
- return model
332
- _SUPPORTED_HF_MODELS = _SUPPORTED_GPT_MODELS + (BloomForCausalLM, OPTForCausalLM)
333
- CAUSAL_LM_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM, BloomForCausalLM, OPTForCausalLM]
334
 
335
  def convert_hf_causal_lm_to_prefix_lm(model: CAUSAL_LM_TYPES) -> CAUSAL_LM_TYPES:
336
  """Converts a HuggingFace Causal LM to a Prefix LM.
@@ -340,8 +111,6 @@ def convert_hf_causal_lm_to_prefix_lm(model: CAUSAL_LM_TYPES) -> CAUSAL_LM_TYPES
340
  - `GPTNeoForCausalLM`
341
  - `GPTNeoXForCausalLM`
342
  - `GPTJForCausalLM`
343
- - `BloomForCausalLM`
344
- - `OPTForCausalLM`
345
 
346
  Conversion to a Prefix LM is done by modifying the `forward` method, and possibly also the
347
  `generate` method and/or select underlying methods depending on the model class.
@@ -391,14 +160,10 @@ def convert_hf_causal_lm_to_prefix_lm(model: CAUSAL_LM_TYPES) -> CAUSAL_LM_TYPES
391
  """
392
  if isinstance(model, _SUPPORTED_GPT_MODELS):
393
  return _convert_gpt_causal_lm_to_prefix_lm(model)
394
- elif isinstance(model, BloomForCausalLM):
395
- return _convert_bloom_causal_lm_to_prefix_lm(model)
396
- elif isinstance(model, OPTForCausalLM):
397
- return _convert_opt_causal_lm_to_prefix_lm(model)
398
  else:
399
  raise TypeError(f'Cannot convert model to Prefix LM. ' + f'Model does not belong to set of supported HF models:' + f'\n{_SUPPORTED_HF_MODELS}')
400
 
401
- def add_bidirectional_mask_if_missing(batch: Dict[str, Any]):
402
  """Attempts to add bidirectional_mask to batch if missing.
403
 
404
  Raises:
 
6
  Prefix LMs accepts a `bidirectional_mask` input in `forward`
7
  and treat the input prompt as the prefix in `generate`.
8
  """
 
 
9
  from types import MethodType
10
+ from typing import Any, List, MutableMapping, Optional, Tuple, Union
11
  import torch
 
 
 
 
12
  from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel
13
  from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoForCausalLM
14
  from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXForCausalLM
15
  from transformers.models.gptj.modeling_gptj import GPTJForCausalLM
 
 
 
 
16
  _SUPPORTED_GPT_MODELS = (GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM)
17
  CAUSAL_GPT_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM]
18
 
 
80
  bidirectional_mask = torch.cat([bidirectional_mask, pad], dim=1)
81
  bidirectional = bidirectional_mask.unsqueeze(1).unsqueeze(1)
82
  for attn_module in attn_modules:
83
+ assert isinstance(attn_module.bias, torch.Tensor)
84
  attn_module.bias.data = torch.logical_or(attn_module.bias.data, bidirectional)
85
  output = call_og_forward()
86
  for attn_module in attn_modules:
87
  attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
88
  return output
89
 
90
+ def generate(self: CAUSAL_GPT_TYPES, *args: Any, **kwargs: Any):
91
  """Wraps original generate to enable PrefixLM attention."""
92
  attn_modules = _get_attn_modules(model)
93
  for attn_module in attn_modules:
 
100
  setattr(model, 'generate', MethodType(generate, model))
101
  setattr(model, '_prefix_lm_converted', True)
102
  return model
103
+ _SUPPORTED_HF_MODELS = _SUPPORTED_GPT_MODELS
104
+ CAUSAL_LM_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
  def convert_hf_causal_lm_to_prefix_lm(model: CAUSAL_LM_TYPES) -> CAUSAL_LM_TYPES:
107
  """Converts a HuggingFace Causal LM to a Prefix LM.
 
111
  - `GPTNeoForCausalLM`
112
  - `GPTNeoXForCausalLM`
113
  - `GPTJForCausalLM`
 
 
114
 
115
  Conversion to a Prefix LM is done by modifying the `forward` method, and possibly also the
116
  `generate` method and/or select underlying methods depending on the model class.
 
160
  """
161
  if isinstance(model, _SUPPORTED_GPT_MODELS):
162
  return _convert_gpt_causal_lm_to_prefix_lm(model)
 
 
 
 
163
  else:
164
  raise TypeError(f'Cannot convert model to Prefix LM. ' + f'Model does not belong to set of supported HF models:' + f'\n{_SUPPORTED_HF_MODELS}')
165
 
166
+ def add_bidirectional_mask_if_missing(batch: MutableMapping):
167
  """Attempts to add bidirectional_mask to batch if missing.
168
 
169
  Raises:
meta_init_context.py CHANGED
@@ -1,4 +1,5 @@
1
  from contextlib import contextmanager
 
2
  import torch
3
  import torch.nn as nn
4
 
@@ -57,25 +58,29 @@ def init_on_device(device: torch.device, include_buffers: bool=False):
57
  if include_buffers:
58
  old_register_buffer = nn.Module.register_buffer
59
 
60
- def register_empty_parameter(module, name, param):
61
- old_register_parameter(module, name, param)
62
  if param is not None:
63
- param_cls = type(module._parameters[name])
64
- kwargs = module._parameters[name].__dict__
65
- module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)
66
-
67
- def register_empty_buffer(module, name, buffer):
68
- old_register_buffer(module, name, buffer)
69
- if buffer is not None:
70
- module._buffers[name] = module._buffers[name].to(device)
 
 
 
 
71
  if include_buffers:
72
  tensor_constructors_to_patch = {torch_function_name: getattr(torch, torch_function_name) for torch_function_name in ['empty', 'zeros', 'ones', 'full']}
73
  else:
74
  tensor_constructors_to_patch = {}
75
 
76
- def patch_tensor_constructor(fn):
77
 
78
- def wrapper(*args, **kwargs):
79
  kwargs['device'] = device
80
  return fn(*args, **kwargs)
81
  return wrapper
 
1
  from contextlib import contextmanager
2
+ from typing import Any, Callable, Optional
3
  import torch
4
  import torch.nn as nn
5
 
 
58
  if include_buffers:
59
  old_register_buffer = nn.Module.register_buffer
60
 
61
+ def register_empty_parameter(self: torch.nn.Module, name: str, param: Optional[torch.nn.Parameter]):
62
+ old_register_parameter(self, name, param)
63
  if param is not None:
64
+ parameter = self._parameters[name]
65
+ assert parameter is not None
66
+ param_cls = type(parameter)
67
+ kwargs = parameter.__dict__
68
+ self._parameters[name] = param_cls(parameter.to(device), **kwargs)
69
+
70
+ def register_empty_buffer(self: torch.nn.Module, name: str, tensor: Optional[torch.Tensor], persistent: bool=True):
71
+ old_register_buffer(self, name, tensor, persistent=persistent)
72
+ if tensor is not None:
73
+ named_buffer = self._buffers[name]
74
+ assert named_buffer is not None
75
+ self._buffers[name] = named_buffer.to(device)
76
  if include_buffers:
77
  tensor_constructors_to_patch = {torch_function_name: getattr(torch, torch_function_name) for torch_function_name in ['empty', 'zeros', 'ones', 'full']}
78
  else:
79
  tensor_constructors_to_patch = {}
80
 
81
+ def patch_tensor_constructor(fn: Callable):
82
 
83
+ def wrapper(*args: Any, **kwargs: Any):
84
  kwargs['device'] = device
85
  return fn(*args, **kwargs)
86
  return wrapper
modeling_mpt.py CHANGED
@@ -4,55 +4,66 @@ Inspired by https://github.com/karpathy/minGPT/blob/master/mingpt/model.py
4
  """
5
  import math
6
  import warnings
7
- from typing import List, Optional, Tuple, Union
8
  import torch
9
  import torch.nn as nn
10
  import torch.nn.functional as F
11
- from transformers import PreTrainedModel, PreTrainedTokenizer, PreTrainedTokenizerFast
12
  from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
13
  from .attention import attn_bias_shape, build_attn_bias
14
  from .blocks import MPTBlock
 
 
 
 
 
15
  from .norm import NORM_CLASS_REGISTRY
16
  from .configuration_mpt import MPTConfig
17
  from .adapt_tokenizer import AutoTokenizerForMOD, adapt_tokenizer_for_denoising
18
  from .hf_prefixlm_converter import add_bidirectional_mask_if_missing, convert_hf_causal_lm_to_prefix_lm
19
  from .meta_init_context import init_empty_weights
20
- from .param_init_fns import MODEL_INIT_REGISTRY, generic_param_init_fn_
21
- Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
 
 
 
 
 
22
 
23
  class MPTPreTrainedModel(PreTrainedModel):
24
  config_class = MPTConfig
25
  base_model_prefix = 'model'
26
- _no_split_modules = ["MPTBlock"]
27
- supports_gradient_checkpointing = True
28
-
29
- def _set_gradient_checkpointing(self, module, value=False):
30
- if isinstance(module, MPTModel):
31
- module.gradient_checkpointing = value
32
 
33
  class MPTModel(MPTPreTrainedModel):
34
 
35
  def __init__(self, config: MPTConfig):
36
  config._validate_config()
37
  super().__init__(config)
38
- self.gradient_checkpointing = False
39
  self.attn_impl = config.attn_config['attn_impl']
40
  self.prefix_lm = config.attn_config['prefix_lm']
41
  self.attn_uses_sequence_id = config.attn_config['attn_uses_sequence_id']
42
  self.alibi = config.attn_config['alibi']
43
  self.alibi_bias_max = config.attn_config['alibi_bias_max']
 
 
 
 
 
 
44
  if config.norm_type.lower() not in NORM_CLASS_REGISTRY.keys():
45
  norm_options = ' | '.join(NORM_CLASS_REGISTRY.keys())
46
  raise NotImplementedError(f'Requested norm type ({config.norm_type}) is not implemented within this repo (Options: {norm_options}).')
47
  norm_class = NORM_CLASS_REGISTRY[config.norm_type.lower()]
48
  self.embedding_fraction = config.embedding_fraction
49
- self.wte = nn.Embedding(config.vocab_size, config.d_model, device=config.init_device)
50
- if not self.alibi:
51
- self.wpe = nn.Embedding(config.max_seq_len, config.d_model, device=config.init_device)
52
  self.emb_drop = nn.Dropout(config.emb_pdrop)
53
  self.blocks = nn.ModuleList([MPTBlock(device=config.init_device, **config.to_dict()) for _ in range(config.n_layers)])
54
  self.norm_f = norm_class(config.d_model, device=config.init_device)
55
  if config.init_device != 'meta':
 
56
  self.apply(self.param_init_fn)
57
  self.is_causal = not self.prefix_lm
58
  self._attn_bias_initialized = False
@@ -61,25 +72,22 @@ class MPTModel(MPTPreTrainedModel):
61
  if config.no_bias:
62
  for module in self.modules():
63
  if hasattr(module, 'bias') and isinstance(module.bias, nn.Parameter):
64
- if config.verbose:
65
- warnings.warn(f'Removing bias ({module.bias}) from {module}.')
66
  module.register_parameter('bias', None)
67
- if config.verbose and config.verbose > 2:
68
- print(self)
69
- if 'verbose' not in self.config.init_config:
70
- self.config.init_config['verbose'] = self.config.verbose
71
- if self.config.init_config['verbose'] > 1:
72
- init_fn_name = self.config.init_config['name']
73
- warnings.warn(f'Using {init_fn_name} initialization.')
74
 
75
- def get_input_embeddings(self):
76
  return self.wte
77
 
78
- def set_input_embeddings(self, value):
79
  self.wte = value
80
 
81
  @torch.no_grad()
82
- def _attn_bias(self, device, dtype, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None):
83
  if not self._attn_bias_initialized:
84
  if self.attn_bias_shape:
85
  self.attn_bias = torch.zeros(self.attn_bias_shape, device=device, dtype=dtype)
@@ -102,14 +110,15 @@ class MPTModel(MPTPreTrainedModel):
102
  if attn_bias is None:
103
  attn_bias = torch.zeros((1, 1, 1, s_k), device=device, dtype=dtype)
104
  else:
105
- attn_bias = attn_bias[:, :, :, -s_k:]
 
106
  if prefix_mask is not None and attention_mask.shape != prefix_mask.shape:
107
  raise ValueError(f'attention_mask shape={attention_mask.shape} ' + f'and prefix_mask shape={prefix_mask.shape} are not equal.')
108
  min_val = torch.finfo(attn_bias.dtype).min
109
  attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
110
  return (attn_bias, None)
111
 
112
- def _apply_prefix_mask(self, attn_bias: torch.Tensor, prefix_mask: torch.Tensor):
113
  (s_k, s_q) = attn_bias.shape[-2:]
114
  if s_k != self.config.max_seq_len or s_q != self.config.max_seq_len:
115
  raise ValueError('attn_bias does not match the expected shape. ' + f'The last two dimensions should both be {self.config.max_length} ' + f'but are {s_k} and {s_q}.')
@@ -124,7 +133,7 @@ class MPTModel(MPTPreTrainedModel):
124
  attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
125
  return attn_bias
126
 
127
- def _apply_sequence_id(self, attn_bias: torch.Tensor, sequence_id: torch.LongTensor):
128
  seq_len = sequence_id.shape[-1]
129
  if seq_len > self.config.max_seq_len:
130
  raise ValueError(f'sequence_id sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
@@ -134,122 +143,86 @@ class MPTModel(MPTPreTrainedModel):
134
  attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
135
  return attn_bias
136
 
137
- def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor] = None):
138
  return_dict = return_dict if return_dict is not None else self.config.return_dict
139
  use_cache = use_cache if use_cache is not None else self.config.use_cache
140
- if self.gradient_checkpointing and self.training:
141
- if use_cache:
142
- use_cache = False
143
- if input_ids is not None and inputs_embeds is not None:
144
- raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
145
- elif input_ids is not None:
146
- batch_size, seq_length = input_ids.shape
147
- elif inputs_embeds is not None:
148
- batch_size, seq_length, _ = inputs_embeds.shape
149
- else:
150
- raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
151
-
152
- seq_length_with_past = seq_length
153
- past_key_values_length = 0
154
-
155
- if past_key_values is not None:
156
- past_key_values_length = past_key_values[0][0].shape[2]
157
- seq_length_with_past = seq_length_with_past + past_key_values_length
158
-
159
  if attention_mask is not None:
160
  attention_mask = attention_mask.bool()
161
- else:
162
- attention_mask = torch.ones(
163
- (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
164
- )
165
-
166
- if inputs_embeds is None:
167
- tok_emb = self.wte(input_ids)
168
- else:
169
- tok_emb = inputs_embeds
170
-
171
  if prefix_mask is not None:
172
  prefix_mask = prefix_mask.bool()
173
  if not return_dict:
174
  raise NotImplementedError('return_dict False is not implemented yet for MPT')
175
  if output_attentions:
176
- raise NotImplementedError('output_attentions is not implemented yet for MPT')
177
- #if attention_mask is not None and attention_mask[:, 0].sum() != attention_mask.shape[0] and self.training:
178
- # raise NotImplementedError('MPT does not support training with left padding.')
 
179
  if self.prefix_lm and prefix_mask is None:
180
  raise ValueError('prefix_mask is a required argument when MPT is configured with prefix_lm=True.')
 
 
181
  if self.training:
182
  if self.attn_uses_sequence_id and sequence_id is None:
183
  raise ValueError('sequence_id is a required argument when MPT is configured with attn_uses_sequence_id=True ' + 'and the model is in train mode.')
184
  elif self.attn_uses_sequence_id is False and sequence_id is not None:
185
  warnings.warn('MPT received non-None input for `sequence_id` but is configured with attn_uses_sequence_id=False. ' + 'This input will be ignored. If you want the model to use `sequence_id`, set attn_uses_sequence_id to True.')
186
- S = seq_length
187
  assert S <= self.config.max_seq_len, f'Cannot forward input with seq_len={S}, this model only supports seq_len<={self.config.max_seq_len}'
188
- if self.alibi:
189
- x = tok_emb
190
- else:
191
  past_position = 0
192
  if past_key_values is not None:
193
  if len(past_key_values) != self.config.n_layers:
194
  raise ValueError(f'past_key_values must provide a past_key_value for each attention ' + f'layer in the network (len(past_key_values)={len(past_key_values)!r}; self.config.n_layers={self.config.n_layers!r}).')
195
  past_position = past_key_values[0][0].size(1)
 
 
196
  if S + past_position > self.config.max_seq_len:
197
- raise ValueError(f'Cannot forward input with past sequence length {past_position} and current sequence length {S + 1}, this model only supports total sequence length <= {self.config.max_seq_len}.')
198
  pos = torch.arange(past_position, S + past_position, dtype=torch.long, device=input_ids.device).unsqueeze(0)
199
- if attention_mask is not None and not self.training:
200
  pos = torch.clamp(pos - torch.cumsum((~attention_mask).to(torch.int32), dim=1)[:, past_position:], min=0)
201
  pos_emb = self.wpe(pos)
202
  x = tok_emb + pos_emb
 
 
203
  if self.embedding_fraction == 1:
204
  x = self.emb_drop(x)
205
  else:
206
  x_shrunk = x * self.embedding_fraction + x.detach() * (1 - self.embedding_fraction)
207
  assert isinstance(self.emb_drop, nn.Module)
208
  x = self.emb_drop(x_shrunk)
209
- (attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=x.dtype, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
 
210
  if use_cache and past_key_values is None:
211
  past_key_values = [() for _ in range(self.config.n_layers)]
212
-
213
  all_hidden_states = () if output_hidden_states else None
 
214
  for (b_idx, block) in enumerate(self.blocks):
215
  if output_hidden_states:
216
  assert all_hidden_states is not None
217
  all_hidden_states = all_hidden_states + (x,)
218
  past_key_value = past_key_values[b_idx] if past_key_values is not None else None
219
-
220
- if self.gradient_checkpointing and self.training:
221
-
222
- def create_custom_forward(module):
223
- def custom_forward(*inputs):
224
- # None for past_key_value
225
- return module(*inputs)
226
-
227
- return custom_forward
228
-
229
- (x, past_key_value) = torch.utils.checkpoint.checkpoint(
230
- create_custom_forward(block),
231
- x,
232
- past_key_value,
233
- attn_bias,
234
- attention_mask,
235
- self.is_causal,
236
- )
237
- else:
238
- (x, past_key_value) = block(x, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=self.is_causal)
239
-
240
- if past_key_values is not None:
241
- past_key_values[b_idx] = past_key_value
242
  x = self.norm_f(x)
243
- return BaseModelOutputWithPast(last_hidden_state=x, past_key_values=past_key_values, hidden_states=all_hidden_states)
 
 
 
244
 
245
- def param_init_fn(self, module):
246
  init_fn_name = self.config.init_config['name']
247
  MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
248
 
249
- def fsdp_wrap_fn(self, module):
250
  return isinstance(module, MPTBlock)
251
 
252
- def activation_checkpointing_fn(self, module):
253
  return isinstance(module, MPTBlock)
254
 
255
  class MPTForCausalLM(MPTPreTrainedModel):
@@ -258,7 +231,13 @@ class MPTForCausalLM(MPTPreTrainedModel):
258
  super().__init__(config)
259
  if not config.tie_word_embeddings:
260
  raise ValueError('MPTForCausalLM only supports tied word embeddings')
261
- self.transformer = MPTModel(config)
 
 
 
 
 
 
262
  self.logit_scale = None
263
  if config.logit_scale is not None:
264
  logit_scale = config.logit_scale
@@ -269,51 +248,53 @@ class MPTForCausalLM(MPTPreTrainedModel):
269
  raise ValueError(f"logit_scale={logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
270
  self.logit_scale = logit_scale
271
 
272
- def get_input_embeddings(self):
273
  return self.transformer.wte
274
 
275
- def set_input_embeddings(self, value):
276
  self.transformer.wte = value
277
 
278
- def get_output_embeddings(self):
279
  return self.transformer.wte
280
 
281
- def set_output_embeddings(self, new_embeddings):
282
  self.transformer.wte = new_embeddings
283
 
284
- def set_decoder(self, decoder):
285
  self.transformer = decoder
286
 
287
- def get_decoder(self):
288
  return self.transformer
289
 
290
- def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, labels: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor] = None):
291
  return_dict = return_dict if return_dict is not None else self.config.return_dict
292
  use_cache = use_cache if use_cache is not None else self.config.use_cache
293
- outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache, inputs_embeds=inputs_embeds)
294
- logits = F.linear(outputs.last_hidden_state, self.transformer.wte.weight)
 
 
295
  if self.logit_scale is not None:
296
  if self.logit_scale == 0:
297
  warnings.warn(f'Multiplying logits by self.logit_scale={self.logit_scale!r}. This will produce uniform (uninformative) outputs.')
298
  logits *= self.logit_scale
299
  loss = None
300
  if labels is not None:
301
- labels = torch.roll(labels, shifts=-1)
302
- labels[:, -1] = -100
303
- loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.to(logits.device).view(-1))
304
- return CausalLMOutputWithPast(loss=loss, logits=logits, past_key_values=outputs.past_key_values, hidden_states=outputs.hidden_states)
305
 
306
- def param_init_fn(self, module):
307
  init_fn_name = self.config.init_config['name']
308
  MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
309
 
310
- def fsdp_wrap_fn(self, module):
311
  return isinstance(module, MPTBlock)
312
 
313
- def activation_checkpointing_fn(self, module):
314
  return isinstance(module, MPTBlock)
315
 
316
- def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
317
  if inputs_embeds is not None:
318
  raise NotImplementedError('inputs_embeds is not implemented for MPT yet')
319
  attention_mask = kwargs['attention_mask'].bool()
@@ -334,7 +315,7 @@ class MPTForCausalLM(MPTPreTrainedModel):
334
  return {'input_ids': input_ids, 'attention_mask': attention_mask, 'prefix_mask': prefix_mask, 'sequence_id': sequence_id, 'past_key_values': past_key_values, 'use_cache': kwargs.get('use_cache', True)}
335
 
336
  @staticmethod
337
- def _reorder_cache(past_key_values, beam_idx):
338
  """Used by HuggingFace generate when using beam search with kv-caching.
339
 
340
  See https://github.com/huggingface/transformers/blob/3ec7a47664ebe40c40f4b722f6bb1cd30c3821ec/src/transformers/models/gpt2/modeling_gpt2.py#L1122-L1133
 
4
  """
5
  import math
6
  import warnings
7
+ from typing import Any, Dict, List, Mapping, MutableMapping, Optional, Tuple, Union
8
  import torch
9
  import torch.nn as nn
10
  import torch.nn.functional as F
11
+ from transformers import PreTrainedModel, PreTrainedTokenizerBase
12
  from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
13
  from .attention import attn_bias_shape, build_attn_bias
14
  from .blocks import MPTBlock
15
+ from .custom_embedding import SharedEmbedding
16
+ from .fc import FC_CLASS_REGISTRY as FC_CLASS_REGISTRY
17
+ from .ffn import FFN_CLASS_REGISTRY as FFN_CLASS_REGISTRY
18
+ from .ffn import MPTMLP as MPTMLP
19
+ from .ffn import build_ffn as build_ffn
20
  from .norm import NORM_CLASS_REGISTRY
21
  from .configuration_mpt import MPTConfig
22
  from .adapt_tokenizer import AutoTokenizerForMOD, adapt_tokenizer_for_denoising
23
  from .hf_prefixlm_converter import add_bidirectional_mask_if_missing, convert_hf_causal_lm_to_prefix_lm
24
  from .meta_init_context import init_empty_weights
25
+ from .param_init_fns import generic_param_init_fn_, MODEL_INIT_REGISTRY
26
+ try:
27
+ from .flash_attn_triton import flash_attn_func as flash_attn_func
28
+ except:
29
+ pass
30
+ import logging
31
+ log = logging.getLogger(__name__)
32
 
33
  class MPTPreTrainedModel(PreTrainedModel):
34
  config_class = MPTConfig
35
  base_model_prefix = 'model'
36
+ _no_split_modules = ['MPTBlock']
 
 
 
 
 
37
 
38
  class MPTModel(MPTPreTrainedModel):
39
 
40
  def __init__(self, config: MPTConfig):
41
  config._validate_config()
42
  super().__init__(config)
 
43
  self.attn_impl = config.attn_config['attn_impl']
44
  self.prefix_lm = config.attn_config['prefix_lm']
45
  self.attn_uses_sequence_id = config.attn_config['attn_uses_sequence_id']
46
  self.alibi = config.attn_config['alibi']
47
  self.alibi_bias_max = config.attn_config['alibi_bias_max']
48
+ self.learned_pos_emb = config.learned_pos_emb
49
+ if config.init_device == 'mixed':
50
+ if dist.get_local_rank() == 0:
51
+ config.init_device = 'cpu'
52
+ else:
53
+ config.init_device = 'meta'
54
  if config.norm_type.lower() not in NORM_CLASS_REGISTRY.keys():
55
  norm_options = ' | '.join(NORM_CLASS_REGISTRY.keys())
56
  raise NotImplementedError(f'Requested norm type ({config.norm_type}) is not implemented within this repo (Options: {norm_options}).')
57
  norm_class = NORM_CLASS_REGISTRY[config.norm_type.lower()]
58
  self.embedding_fraction = config.embedding_fraction
59
+ self.wte = SharedEmbedding(config.vocab_size, config.d_model, device=config.init_device)
60
+ if self.learned_pos_emb:
61
+ self.wpe = torch.nn.Embedding(config.max_seq_len, config.d_model, device=config.init_device)
62
  self.emb_drop = nn.Dropout(config.emb_pdrop)
63
  self.blocks = nn.ModuleList([MPTBlock(device=config.init_device, **config.to_dict()) for _ in range(config.n_layers)])
64
  self.norm_f = norm_class(config.d_model, device=config.init_device)
65
  if config.init_device != 'meta':
66
+ log.info(f'We recommend using config.init_device="meta" with Composer + FSDP for faster initialization.')
67
  self.apply(self.param_init_fn)
68
  self.is_causal = not self.prefix_lm
69
  self._attn_bias_initialized = False
 
72
  if config.no_bias:
73
  for module in self.modules():
74
  if hasattr(module, 'bias') and isinstance(module.bias, nn.Parameter):
75
+ log.info(f'Removing bias ({module.bias}) from {module}.')
 
76
  module.register_parameter('bias', None)
77
+ if hasattr(module, 'use_bias'):
78
+ log.info(f'Setting use_bias=False for {module}.')
79
+ module.use_bias = False
80
+ log.debug(self)
81
+ log.debug(f"Using {self.config.init_config['name']} initialization.")
 
 
82
 
83
+ def get_input_embeddings(self) -> nn.Embedding:
84
  return self.wte
85
 
86
+ def set_input_embeddings(self, value: nn.Embedding) -> None:
87
  self.wte = value
88
 
89
  @torch.no_grad()
90
+ def _attn_bias(self, device: torch.device, dtype: torch.dtype, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None) -> Tuple[Optional[torch.Tensor], Optional[torch.ByteTensor]]:
91
  if not self._attn_bias_initialized:
92
  if self.attn_bias_shape:
93
  self.attn_bias = torch.zeros(self.attn_bias_shape, device=device, dtype=dtype)
 
110
  if attn_bias is None:
111
  attn_bias = torch.zeros((1, 1, 1, s_k), device=device, dtype=dtype)
112
  else:
113
+ _s_k = max(0, attn_bias.size(-1) - s_k)
114
+ attn_bias = attn_bias[:, :, :, _s_k:]
115
  if prefix_mask is not None and attention_mask.shape != prefix_mask.shape:
116
  raise ValueError(f'attention_mask shape={attention_mask.shape} ' + f'and prefix_mask shape={prefix_mask.shape} are not equal.')
117
  min_val = torch.finfo(attn_bias.dtype).min
118
  attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
119
  return (attn_bias, None)
120
 
121
+ def _apply_prefix_mask(self, attn_bias: torch.Tensor, prefix_mask: torch.Tensor) -> torch.Tensor:
122
  (s_k, s_q) = attn_bias.shape[-2:]
123
  if s_k != self.config.max_seq_len or s_q != self.config.max_seq_len:
124
  raise ValueError('attn_bias does not match the expected shape. ' + f'The last two dimensions should both be {self.config.max_length} ' + f'but are {s_k} and {s_q}.')
 
133
  attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
134
  return attn_bias
135
 
136
+ def _apply_sequence_id(self, attn_bias: torch.Tensor, sequence_id: torch.LongTensor) -> torch.Tensor:
137
  seq_len = sequence_id.shape[-1]
138
  if seq_len > self.config.max_seq_len:
139
  raise ValueError(f'sequence_id sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
 
143
  attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
144
  return attn_bias
145
 
146
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.Tensor]=None) -> BaseModelOutputWithPast:
147
  return_dict = return_dict if return_dict is not None else self.config.return_dict
148
  use_cache = use_cache if use_cache is not None else self.config.use_cache
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  if attention_mask is not None:
150
  attention_mask = attention_mask.bool()
 
 
 
 
 
 
 
 
 
 
151
  if prefix_mask is not None:
152
  prefix_mask = prefix_mask.bool()
153
  if not return_dict:
154
  raise NotImplementedError('return_dict False is not implemented yet for MPT')
155
  if output_attentions:
156
+ if self.attn_impl != 'torch':
157
+ raise NotImplementedError('output_attentions is not implemented for MPT when using attn_impl `flash` or `triton`.')
158
+ if self.training and attention_mask is not None and (attention_mask[:, 0].sum() != attention_mask.shape[0]):
159
+ raise NotImplementedError('MPT does not support training with left padding.')
160
  if self.prefix_lm and prefix_mask is None:
161
  raise ValueError('prefix_mask is a required argument when MPT is configured with prefix_lm=True.')
162
+ if inputs_embeds is not None:
163
+ raise NotImplementedError('inputs_embeds is not implemented for MPT.')
164
  if self.training:
165
  if self.attn_uses_sequence_id and sequence_id is None:
166
  raise ValueError('sequence_id is a required argument when MPT is configured with attn_uses_sequence_id=True ' + 'and the model is in train mode.')
167
  elif self.attn_uses_sequence_id is False and sequence_id is not None:
168
  warnings.warn('MPT received non-None input for `sequence_id` but is configured with attn_uses_sequence_id=False. ' + 'This input will be ignored. If you want the model to use `sequence_id`, set attn_uses_sequence_id to True.')
169
+ S = input_ids.size(1)
170
  assert S <= self.config.max_seq_len, f'Cannot forward input with seq_len={S}, this model only supports seq_len<={self.config.max_seq_len}'
171
+ tok_emb = self.wte(input_ids)
172
+ if self.learned_pos_emb:
 
173
  past_position = 0
174
  if past_key_values is not None:
175
  if len(past_key_values) != self.config.n_layers:
176
  raise ValueError(f'past_key_values must provide a past_key_value for each attention ' + f'layer in the network (len(past_key_values)={len(past_key_values)!r}; self.config.n_layers={self.config.n_layers!r}).')
177
  past_position = past_key_values[0][0].size(1)
178
+ if self.attn_impl == 'torch':
179
+ past_position = past_key_values[0][0].size(3)
180
  if S + past_position > self.config.max_seq_len:
181
+ raise ValueError(f'Cannot forward input with past sequence length {past_position} and current sequence length ' + f'{S + 1}, this model only supports total sequence length <= {self.config.max_seq_len}.')
182
  pos = torch.arange(past_position, S + past_position, dtype=torch.long, device=input_ids.device).unsqueeze(0)
183
+ if attention_mask is not None:
184
  pos = torch.clamp(pos - torch.cumsum((~attention_mask).to(torch.int32), dim=1)[:, past_position:], min=0)
185
  pos_emb = self.wpe(pos)
186
  x = tok_emb + pos_emb
187
+ else:
188
+ x = tok_emb
189
  if self.embedding_fraction == 1:
190
  x = self.emb_drop(x)
191
  else:
192
  x_shrunk = x * self.embedding_fraction + x.detach() * (1 - self.embedding_fraction)
193
  assert isinstance(self.emb_drop, nn.Module)
194
  x = self.emb_drop(x_shrunk)
195
+ (attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=torch.float32, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
196
+ presents = () if use_cache else None
197
  if use_cache and past_key_values is None:
198
  past_key_values = [() for _ in range(self.config.n_layers)]
 
199
  all_hidden_states = () if output_hidden_states else None
200
+ all_self_attns = () if output_attentions else None
201
  for (b_idx, block) in enumerate(self.blocks):
202
  if output_hidden_states:
203
  assert all_hidden_states is not None
204
  all_hidden_states = all_hidden_states + (x,)
205
  past_key_value = past_key_values[b_idx] if past_key_values is not None else None
206
+ (x, attn_weights, present) = block(x, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=self.is_causal, output_attentions=bool(output_attentions))
207
+ if presents is not None:
208
+ presents += (present,)
209
+ if output_attentions:
210
+ assert all_self_attns is not None
211
+ all_self_attns = all_self_attns + (attn_weights,)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
  x = self.norm_f(x)
213
+ if output_hidden_states:
214
+ assert all_hidden_states is not None
215
+ all_hidden_states = all_hidden_states + (x,)
216
+ return BaseModelOutputWithPast(last_hidden_state=x, past_key_values=presents, hidden_states=all_hidden_states, attentions=all_self_attns)
217
 
218
+ def param_init_fn(self, module: nn.Module) -> None:
219
  init_fn_name = self.config.init_config['name']
220
  MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
221
 
222
+ def fsdp_wrap_fn(self, module: nn.Module) -> bool:
223
  return isinstance(module, MPTBlock)
224
 
225
+ def activation_checkpointing_fn(self, module: nn.Module) -> bool:
226
  return isinstance(module, MPTBlock)
227
 
228
  class MPTForCausalLM(MPTPreTrainedModel):
 
231
  super().__init__(config)
232
  if not config.tie_word_embeddings:
233
  raise ValueError('MPTForCausalLM only supports tied word embeddings')
234
+ log.info(f'Instantiating an MPTForCausalLM model from {__file__}')
235
+ self.transformer: MPTModel = MPTModel(config)
236
+ for child in self.transformer.children():
237
+ if isinstance(child, torch.nn.ModuleList):
238
+ continue
239
+ if isinstance(child, torch.nn.Module):
240
+ child._fsdp_wrap = True
241
  self.logit_scale = None
242
  if config.logit_scale is not None:
243
  logit_scale = config.logit_scale
 
248
  raise ValueError(f"logit_scale={logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
249
  self.logit_scale = logit_scale
250
 
251
+ def get_input_embeddings(self) -> nn.Embedding:
252
  return self.transformer.wte
253
 
254
+ def set_input_embeddings(self, value: Union[SharedEmbedding, nn.Embedding]) -> None:
255
  self.transformer.wte = value
256
 
257
+ def get_output_embeddings(self) -> nn.Embedding:
258
  return self.transformer.wte
259
 
260
+ def set_output_embeddings(self, new_embeddings: Union[SharedEmbedding, nn.Embedding]) -> None:
261
  self.transformer.wte = new_embeddings
262
 
263
+ def set_decoder(self, decoder: MPTModel) -> None:
264
  self.transformer = decoder
265
 
266
+ def get_decoder(self) -> MPTModel:
267
  return self.transformer
268
 
269
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, labels: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor]=None) -> CausalLMOutputWithPast:
270
  return_dict = return_dict if return_dict is not None else self.config.return_dict
271
  use_cache = use_cache if use_cache is not None else self.config.use_cache
272
+ if inputs_embeds is not None:
273
+ raise NotImplementedError('inputs_embeds has to be None (for hf/peft support).')
274
+ outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache)
275
+ logits = self.transformer.wte(outputs.last_hidden_state.to(self.transformer.wte.weight.device), True)
276
  if self.logit_scale is not None:
277
  if self.logit_scale == 0:
278
  warnings.warn(f'Multiplying logits by self.logit_scale={self.logit_scale!r}. This will produce uniform (uninformative) outputs.')
279
  logits *= self.logit_scale
280
  loss = None
281
  if labels is not None:
282
+ _labels = torch.roll(labels, shifts=-1)
283
+ _labels[:, -1] = -100
284
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), _labels.to(logits.device).view(-1))
285
+ return CausalLMOutputWithPast(loss=loss, logits=logits, past_key_values=outputs.past_key_values, hidden_states=outputs.hidden_states, attentions=outputs.attentions)
286
 
287
+ def param_init_fn(self, module: nn.Module) -> None:
288
  init_fn_name = self.config.init_config['name']
289
  MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
290
 
291
+ def fsdp_wrap_fn(self, module: nn.Module) -> bool:
292
  return isinstance(module, MPTBlock)
293
 
294
+ def activation_checkpointing_fn(self, module: nn.Module) -> bool:
295
  return isinstance(module, MPTBlock)
296
 
297
+ def prepare_inputs_for_generation(self, input_ids: torch.Tensor, past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]]=None, inputs_embeds: Optional[torch.Tensor]=None, **kwargs: Any) -> Dict[str, Any]:
298
  if inputs_embeds is not None:
299
  raise NotImplementedError('inputs_embeds is not implemented for MPT yet')
300
  attention_mask = kwargs['attention_mask'].bool()
 
315
  return {'input_ids': input_ids, 'attention_mask': attention_mask, 'prefix_mask': prefix_mask, 'sequence_id': sequence_id, 'past_key_values': past_key_values, 'use_cache': kwargs.get('use_cache', True)}
316
 
317
  @staticmethod
318
+ def _reorder_cache(past_key_values: List[Tuple[torch.Tensor, torch.Tensor]], beam_idx: torch.LongTensor) -> List[Tuple[torch.Tensor, ...]]:
319
  """Used by HuggingFace generate when using beam search with kv-caching.
320
 
321
  See https://github.com/huggingface/transformers/blob/3ec7a47664ebe40c40f4b722f6bb1cd30c3821ec/src/transformers/models/gpt2/modeling_gpt2.py#L1122-L1133
norm.py CHANGED
@@ -1,6 +1,7 @@
 
1
  import torch
2
 
3
- def _cast_if_autocast_enabled(tensor):
4
  if torch.is_autocast_enabled():
5
  if tensor.device.type == 'cuda':
6
  dtype = torch.get_autocast_gpu_dtype()
@@ -13,10 +14,10 @@ def _cast_if_autocast_enabled(tensor):
13
 
14
  class LPLayerNorm(torch.nn.LayerNorm):
15
 
16
- def __init__(self, normalized_shape, eps=1e-05, elementwise_affine=True, device=None, dtype=None):
17
  super().__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=elementwise_affine, device=device, dtype=dtype)
18
 
19
- def forward(self, x):
20
  module_device = x.device
21
  downcast_x = _cast_if_autocast_enabled(x)
22
  downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
@@ -24,15 +25,15 @@ class LPLayerNorm(torch.nn.LayerNorm):
24
  with torch.autocast(enabled=False, device_type=module_device.type):
25
  return torch.nn.functional.layer_norm(downcast_x, self.normalized_shape, downcast_weight, downcast_bias, self.eps)
26
 
27
- def rms_norm(x, weight=None, eps=1e-05):
28
- output = x / torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + eps)
29
  if weight is not None:
30
  return output * weight
31
  return output
32
 
33
  class RMSNorm(torch.nn.Module):
34
 
35
- def __init__(self, normalized_shape, eps=1e-05, weight=True, dtype=None, device=None):
36
  super().__init__()
37
  self.eps = eps
38
  if weight:
@@ -40,17 +41,17 @@ class RMSNorm(torch.nn.Module):
40
  else:
41
  self.register_parameter('weight', None)
42
 
43
- def forward(self, x):
44
  return rms_norm(x.float(), self.weight, self.eps).to(dtype=x.dtype)
45
 
46
  class LPRMSNorm(RMSNorm):
47
 
48
- def __init__(self, normalized_shape, eps=1e-05, weight=True, dtype=None, device=None):
49
  super().__init__(normalized_shape=normalized_shape, eps=eps, weight=weight, dtype=dtype, device=device)
50
 
51
- def forward(self, x):
52
  downcast_x = _cast_if_autocast_enabled(x)
53
  downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
54
  with torch.autocast(enabled=False, device_type=x.device.type):
55
  return rms_norm(downcast_x, downcast_weight, self.eps).to(dtype=x.dtype)
56
- NORM_CLASS_REGISTRY = {'layernorm': torch.nn.LayerNorm, 'low_precision_layernorm': LPLayerNorm, 'rmsnorm': RMSNorm, 'low_precision_rmsnorm': LPRMSNorm}
 
1
+ from typing import Dict, List, Optional, Type, Union
2
  import torch
3
 
4
+ def _cast_if_autocast_enabled(tensor: torch.Tensor) -> torch.Tensor:
5
  if torch.is_autocast_enabled():
6
  if tensor.device.type == 'cuda':
7
  dtype = torch.get_autocast_gpu_dtype()
 
14
 
15
  class LPLayerNorm(torch.nn.LayerNorm):
16
 
17
+ def __init__(self, normalized_shape: Union[int, List[int], torch.Size], eps: float=1e-05, elementwise_affine: bool=True, device: Optional[torch.device]=None, dtype: Optional[torch.dtype]=None):
18
  super().__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=elementwise_affine, device=device, dtype=dtype)
19
 
20
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
21
  module_device = x.device
22
  downcast_x = _cast_if_autocast_enabled(x)
23
  downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
 
25
  with torch.autocast(enabled=False, device_type=module_device.type):
26
  return torch.nn.functional.layer_norm(downcast_x, self.normalized_shape, downcast_weight, downcast_bias, self.eps)
27
 
28
+ def rms_norm(x: torch.Tensor, weight: Optional[torch.Tensor]=None, eps: float=1e-05) -> torch.Tensor:
29
+ output = x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + eps)
30
  if weight is not None:
31
  return output * weight
32
  return output
33
 
34
  class RMSNorm(torch.nn.Module):
35
 
36
+ def __init__(self, normalized_shape: Union[int, List[int], torch.Size], eps: float=1e-05, weight: bool=True, dtype: Optional[torch.dtype]=None, device: Optional[torch.device]=None):
37
  super().__init__()
38
  self.eps = eps
39
  if weight:
 
41
  else:
42
  self.register_parameter('weight', None)
43
 
44
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
45
  return rms_norm(x.float(), self.weight, self.eps).to(dtype=x.dtype)
46
 
47
  class LPRMSNorm(RMSNorm):
48
 
49
+ def __init__(self, normalized_shape: Union[int, List[int], torch.Size], eps: float=1e-05, weight: bool=True, dtype: Optional[torch.dtype]=None, device: Optional[torch.device]=None):
50
  super().__init__(normalized_shape=normalized_shape, eps=eps, weight=weight, dtype=dtype, device=device)
51
 
52
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
53
  downcast_x = _cast_if_autocast_enabled(x)
54
  downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
55
  with torch.autocast(enabled=False, device_type=x.device.type):
56
  return rms_norm(downcast_x, downcast_weight, self.eps).to(dtype=x.dtype)
57
+ NORM_CLASS_REGISTRY: Dict[str, Type[torch.nn.Module]] = {'layernorm': torch.nn.LayerNorm, 'low_precision_layernorm': LPLayerNorm, 'rmsnorm': RMSNorm, 'low_precision_rmsnorm': LPRMSNorm}
param_init_fns.py CHANGED
@@ -2,22 +2,26 @@ import math
2
  import warnings
3
  from collections.abc import Sequence
4
  from functools import partial
5
- from typing import Optional, Tuple, Union
6
  import torch
7
  from torch import nn
 
8
  from .norm import NORM_CLASS_REGISTRY
 
 
 
 
9
 
10
- def torch_default_param_init_fn_(module: nn.Module, verbose: int=0, **kwargs):
11
  del kwargs
12
- if verbose > 1:
13
- warnings.warn(f"Initializing network using module's reset_parameters attribute")
14
- if hasattr(module, 'reset_parameters'):
15
  module.reset_parameters()
16
 
17
- def fused_init_helper_(module: nn.Module, init_fn_):
18
  _fused = getattr(module, '_fused', None)
19
  if _fused is None:
20
  raise RuntimeError(f'Internal logic error')
 
21
  (dim, splits) = _fused
22
  splits = (0, *splits, module.weight.size(dim))
23
  for (s, e) in zip(splits[:-1], splits[1:]):
@@ -25,10 +29,8 @@ def fused_init_helper_(module: nn.Module, init_fn_):
25
  slice_indices[dim] = slice(s, e)
26
  init_fn_(module.weight[slice_indices])
27
 
28
- def generic_param_init_fn_(module: nn.Module, init_fn_, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
29
  del kwargs
30
- if verbose > 1:
31
- warnings.warn(f'If model has bias parameters they are initialized to 0.')
32
  init_div_is_residual = init_div_is_residual
33
  if init_div_is_residual is False:
34
  div_is_residual = 1.0
@@ -36,20 +38,18 @@ def generic_param_init_fn_(module: nn.Module, init_fn_, n_layers: int, d_model:
36
  div_is_residual = math.sqrt(2 * n_layers)
37
  elif isinstance(init_div_is_residual, float) or isinstance(init_div_is_residual, int):
38
  div_is_residual = init_div_is_residual
39
- elif isinstance(init_div_is_residual, str) and init_div_is_residual.isnumeric():
40
  div_is_residual = float(init_div_is_residual)
41
  else:
42
  div_is_residual = 1.0
43
  raise ValueError(f'Expected init_div_is_residual to be boolean or numeric, got {init_div_is_residual}')
44
- if init_div_is_residual is not False:
45
- if verbose > 1:
46
- warnings.warn(f'Initializing _is_residual layers then dividing them by {div_is_residual:.3f}. ' + f'Set `init_div_is_residual: false` in init config to disable this.')
47
- if isinstance(module, nn.Linear):
48
  if hasattr(module, '_fused'):
49
  fused_init_helper_(module, init_fn_)
50
  else:
51
  init_fn_(module.weight)
52
  if module.bias is not None:
 
53
  torch.nn.init.zeros_(module.bias)
54
  if init_div_is_residual is not False and getattr(module, '_is_residual', False):
55
  with torch.no_grad():
@@ -60,8 +60,6 @@ def generic_param_init_fn_(module: nn.Module, init_fn_, n_layers: int, d_model:
60
  if std == 0:
61
  warnings.warn(f'Embedding layer initialized to 0.')
62
  emb_init_fn_ = partial(torch.nn.init.normal_, mean=0.0, std=std)
63
- if verbose > 1:
64
- warnings.warn(f'Embedding layer initialized using normal distribution with mean=0 and std={std!r}.')
65
  elif emb_init_uniform_lim is not None:
66
  lim = emb_init_uniform_lim
67
  if isinstance(lim, Sequence):
@@ -75,17 +73,13 @@ def generic_param_init_fn_(module: nn.Module, init_fn_, n_layers: int, d_model:
75
  lim = [-lim, lim]
76
  (a, b) = lim
77
  emb_init_fn_ = partial(torch.nn.init.uniform_, a=a, b=b)
78
- if verbose > 1:
79
- warnings.warn(f'Embedding layer initialized using uniform distribution in range {lim}.')
80
  else:
81
  emb_init_fn_ = init_fn_
82
  emb_init_fn_(module.weight)
83
  elif isinstance(module, tuple(set(NORM_CLASS_REGISTRY.values()))):
84
- if verbose > 1:
85
- warnings.warn(f'Norm weights are set to 1. If norm layer has a bias it is initialized to 0.')
86
- if hasattr(module, 'weight') and module.weight is not None:
87
  torch.nn.init.ones_(module.weight)
88
- if hasattr(module, 'bias') and module.bias is not None:
89
  torch.nn.init.zeros_(module.bias)
90
  elif isinstance(module, nn.MultiheadAttention):
91
  if module._qkv_same_embed_dim:
@@ -114,32 +108,45 @@ def generic_param_init_fn_(module: nn.Module, init_fn_, n_layers: int, d_model:
114
  module.out_proj.weight.div_(div_is_residual)
115
  if module.out_proj.bias is not None:
116
  torch.nn.init.zeros_(module.out_proj.bias)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  else:
118
  for _ in module.parameters(recurse=False):
119
  raise NotImplementedError(f'{module.__class__.__name__} parameters are not initialized by param_init_fn.')
120
 
121
- def _normal_init_(std, mean=0.0):
122
  return partial(torch.nn.init.normal_, mean=mean, std=std)
123
 
124
- def _normal_param_init_fn_(module: nn.Module, std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
125
  del kwargs
126
  init_fn_ = _normal_init_(std=std)
127
- if verbose > 1:
128
- warnings.warn(f'Using torch.nn.init.normal_ init fn mean=0.0, std={std}')
129
- generic_param_init_fn_(module=module, init_fn_=init_fn_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
130
 
131
- def baseline_param_init_fn_(module: nn.Module, init_std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
132
  del kwargs
133
  if init_std is None:
134
  raise ValueError("You must set model.init_config['init_std'] to a float value to use the default initialization scheme.")
135
- _normal_param_init_fn_(module=module, std=init_std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
136
 
137
- def small_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
138
  del kwargs
139
  std = math.sqrt(2 / (5 * d_model))
140
- _normal_param_init_fn_(module=module, std=std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
141
 
142
- def neox_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
143
  """From section 2.3.1 of GPT-NeoX-20B:
144
 
145
  An Open-Source AutoregressiveLanguage Model — Black et. al. (2022)
@@ -148,34 +155,25 @@ def neox_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, emb_init
148
  """
149
  del kwargs
150
  residual_div = n_layers / math.sqrt(10)
151
- if verbose > 1:
152
- warnings.warn(f'setting init_div_is_residual to {residual_div}')
153
- small_param_init_fn_(module=module, d_model=d_model, n_layers=n_layers, init_div_is_residual=residual_div, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
154
 
155
- def kaiming_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', verbose: int=0, **kwargs):
156
  del kwargs
157
- if verbose > 1:
158
- warnings.warn(f'Using nn.init.kaiming_uniform_ init fn with parameters: ' + f'a={init_gain}, mode={fan_mode}, nonlinearity={init_nonlinearity}')
159
  kaiming_uniform_ = partial(nn.init.kaiming_uniform_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
160
- generic_param_init_fn_(module=module, init_fn_=kaiming_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
161
 
162
- def kaiming_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', verbose: int=0, **kwargs):
163
  del kwargs
164
- if verbose > 1:
165
- warnings.warn(f'Using nn.init.kaiming_normal_ init fn with parameters: ' + f'a={init_gain}, mode={fan_mode}, nonlinearity={init_nonlinearity}')
166
  kaiming_normal_ = partial(torch.nn.init.kaiming_normal_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
167
- generic_param_init_fn_(module=module, init_fn_=kaiming_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
168
 
169
- def xavier_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, verbose: int=0, **kwargs):
170
  del kwargs
171
  xavier_uniform_ = partial(torch.nn.init.xavier_uniform_, gain=init_gain)
172
- if verbose > 1:
173
- warnings.warn(f'Using torch.nn.init.xavier_uniform_ init fn with parameters: ' + f'gain={init_gain}')
174
- generic_param_init_fn_(module=module, init_fn_=xavier_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
175
 
176
- def xavier_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, verbose: int=0, **kwargs):
 
177
  xavier_normal_ = partial(torch.nn.init.xavier_normal_, gain=init_gain)
178
- if verbose > 1:
179
- warnings.warn(f'Using torch.nn.init.xavier_normal_ init fn with parameters: ' + f'gain={init_gain}')
180
- generic_param_init_fn_(module=module, init_fn_=xavier_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
181
  MODEL_INIT_REGISTRY = {'default_': torch_default_param_init_fn_, 'baseline_': baseline_param_init_fn_, 'kaiming_uniform_': kaiming_uniform_param_init_fn_, 'kaiming_normal_': kaiming_normal_param_init_fn_, 'neox_init_': neox_param_init_fn_, 'small_init_': small_param_init_fn_, 'xavier_uniform_': xavier_uniform_param_init_fn_, 'xavier_normal_': xavier_normal_param_init_fn_}
 
2
  import warnings
3
  from collections.abc import Sequence
4
  from functools import partial
5
+ from typing import Any, Callable, Optional, Tuple, Union
6
  import torch
7
  from torch import nn
8
+ from .fc import FC_CLASS_REGISTRY
9
  from .norm import NORM_CLASS_REGISTRY
10
+ try:
11
+ import transformer_engine.pytorch as te
12
+ except:
13
+ te = None
14
 
15
+ def torch_default_param_init_fn_(module: nn.Module, **kwargs: Any) -> None:
16
  del kwargs
17
+ if hasattr(module, 'reset_parameters') and isinstance(module.reset_parameters, Callable):
 
 
18
  module.reset_parameters()
19
 
20
+ def fused_init_helper_(module: nn.Module, init_fn_: Callable) -> None:
21
  _fused = getattr(module, '_fused', None)
22
  if _fused is None:
23
  raise RuntimeError(f'Internal logic error')
24
+ assert isinstance(module.weight, torch.Tensor)
25
  (dim, splits) = _fused
26
  splits = (0, *splits, module.weight.size(dim))
27
  for (s, e) in zip(splits[:-1], splits[1:]):
 
29
  slice_indices[dim] = slice(s, e)
30
  init_fn_(module.weight[slice_indices])
31
 
32
+ def generic_param_init_fn_(module: nn.Module, init_fn_: Callable, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
33
  del kwargs
 
 
34
  init_div_is_residual = init_div_is_residual
35
  if init_div_is_residual is False:
36
  div_is_residual = 1.0
 
38
  div_is_residual = math.sqrt(2 * n_layers)
39
  elif isinstance(init_div_is_residual, float) or isinstance(init_div_is_residual, int):
40
  div_is_residual = init_div_is_residual
41
+ elif init_div_is_residual.isnumeric():
42
  div_is_residual = float(init_div_is_residual)
43
  else:
44
  div_is_residual = 1.0
45
  raise ValueError(f'Expected init_div_is_residual to be boolean or numeric, got {init_div_is_residual}')
46
+ if isinstance(module, tuple(set(FC_CLASS_REGISTRY.values()))):
 
 
 
47
  if hasattr(module, '_fused'):
48
  fused_init_helper_(module, init_fn_)
49
  else:
50
  init_fn_(module.weight)
51
  if module.bias is not None:
52
+ assert isinstance(module.bias, torch.Tensor)
53
  torch.nn.init.zeros_(module.bias)
54
  if init_div_is_residual is not False and getattr(module, '_is_residual', False):
55
  with torch.no_grad():
 
60
  if std == 0:
61
  warnings.warn(f'Embedding layer initialized to 0.')
62
  emb_init_fn_ = partial(torch.nn.init.normal_, mean=0.0, std=std)
 
 
63
  elif emb_init_uniform_lim is not None:
64
  lim = emb_init_uniform_lim
65
  if isinstance(lim, Sequence):
 
73
  lim = [-lim, lim]
74
  (a, b) = lim
75
  emb_init_fn_ = partial(torch.nn.init.uniform_, a=a, b=b)
 
 
76
  else:
77
  emb_init_fn_ = init_fn_
78
  emb_init_fn_(module.weight)
79
  elif isinstance(module, tuple(set(NORM_CLASS_REGISTRY.values()))):
80
+ if hasattr(module, 'weight') and isinstance(module.weight, torch.Tensor):
 
 
81
  torch.nn.init.ones_(module.weight)
82
+ if hasattr(module, 'bias') and isinstance(module.bias, torch.Tensor):
83
  torch.nn.init.zeros_(module.bias)
84
  elif isinstance(module, nn.MultiheadAttention):
85
  if module._qkv_same_embed_dim:
 
108
  module.out_proj.weight.div_(div_is_residual)
109
  if module.out_proj.bias is not None:
110
  torch.nn.init.zeros_(module.out_proj.bias)
111
+ elif te is not None and isinstance(module, te.LayerNormMLP):
112
+ if isinstance(module.layer_norm_weight, torch.Tensor):
113
+ torch.nn.init.ones_(module.layer_norm_weight)
114
+ if isinstance(module.layer_norm_bias, torch.Tensor):
115
+ torch.nn.init.zeros_(module.layer_norm_bias)
116
+ init_fn_(module.fc1_weight)
117
+ if module.fc1_bias is not None:
118
+ assert isinstance(module.fc1_bias, torch.Tensor)
119
+ torch.nn.init.zeros_(module.fc1_bias)
120
+ init_fn_(module.fc2_weight)
121
+ if module.fc2_bias is not None:
122
+ assert isinstance(module.fc2_bias, torch.Tensor)
123
+ torch.nn.init.zeros_(module.fc2_bias)
124
+ with torch.no_grad():
125
+ module.fc2_weight.div_(div_is_residual)
126
  else:
127
  for _ in module.parameters(recurse=False):
128
  raise NotImplementedError(f'{module.__class__.__name__} parameters are not initialized by param_init_fn.')
129
 
130
+ def _normal_init_(std: float, mean: float=0.0) -> Callable:
131
  return partial(torch.nn.init.normal_, mean=mean, std=std)
132
 
133
+ def _normal_param_init_fn_(module: nn.Module, std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
134
  del kwargs
135
  init_fn_ = _normal_init_(std=std)
136
+ generic_param_init_fn_(module=module, init_fn_=init_fn_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
 
 
137
 
138
+ def baseline_param_init_fn_(module: nn.Module, init_std: Optional[float], n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
139
  del kwargs
140
  if init_std is None:
141
  raise ValueError("You must set model.init_config['init_std'] to a float value to use the default initialization scheme.")
142
+ _normal_param_init_fn_(module=module, std=init_std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
143
 
144
+ def small_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
145
  del kwargs
146
  std = math.sqrt(2 / (5 * d_model))
147
+ _normal_param_init_fn_(module=module, std=std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
148
 
149
+ def neox_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
150
  """From section 2.3.1 of GPT-NeoX-20B:
151
 
152
  An Open-Source AutoregressiveLanguage Model — Black et. al. (2022)
 
155
  """
156
  del kwargs
157
  residual_div = n_layers / math.sqrt(10)
158
+ small_param_init_fn_(module=module, d_model=d_model, n_layers=n_layers, init_div_is_residual=residual_div, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
 
 
159
 
160
+ def kaiming_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', **kwargs: Any) -> None:
161
  del kwargs
 
 
162
  kaiming_uniform_ = partial(nn.init.kaiming_uniform_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
163
+ generic_param_init_fn_(module=module, init_fn_=kaiming_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
164
 
165
+ def kaiming_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', **kwargs: Any) -> None:
166
  del kwargs
 
 
167
  kaiming_normal_ = partial(torch.nn.init.kaiming_normal_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
168
+ generic_param_init_fn_(module=module, init_fn_=kaiming_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
169
 
170
+ def xavier_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, **kwargs: Any) -> None:
171
  del kwargs
172
  xavier_uniform_ = partial(torch.nn.init.xavier_uniform_, gain=init_gain)
173
+ generic_param_init_fn_(module=module, init_fn_=xavier_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
 
 
174
 
175
+ def xavier_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, **kwargs: Any) -> None:
176
+ del kwargs
177
  xavier_normal_ = partial(torch.nn.init.xavier_normal_, gain=init_gain)
178
+ generic_param_init_fn_(module=module, init_fn_=xavier_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
 
 
179
  MODEL_INIT_REGISTRY = {'default_': torch_default_param_init_fn_, 'baseline_': baseline_param_init_fn_, 'kaiming_uniform_': kaiming_uniform_param_init_fn_, 'kaiming_normal_': kaiming_normal_param_init_fn_, 'neox_init_': neox_param_init_fn_, 'small_init_': small_param_init_fn_, 'xavier_uniform_': xavier_uniform_param_init_fn_, 'xavier_normal_': xavier_normal_param_init_fn_}
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ einops==0.5.0
2
+ triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir_sm90#subdirectory=python