Model Card describes input size as 256 but Tokenizer is using 512 & output names of model changed
Around 2 weeks ago it looks like a pull request was created to change the models model_max_length from 256 to 512, however the model card states "By default, input text longer than 256 word pieces is truncated" under the "Intended Use" section. The Sentence Transformer documentation page here (https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) also shows the model's max sequence length is 256. I'm loading in the Tokenizer using the AutoTokenizer.from_pretrained() method to load the tokenizer. The tokenizer currently outputs up to 512, so I'm wondering why this is when the model seems designed for 256?
Also, previously when I loaded the model using the AutoModel.from_pretrained() method the outputs produced by the model were something like token_embeddings & sentence_embedding. However now when I execute the model the outputs come out as last_hidden_state & pooler_output. Did something change upstream that all-MiniLM-L6-v2 inherited?
I believe last hidden state is the token embeddings and you can perform mean pooling on it to get sentence embeddings or use the pooler_output as the sentence embedding. They show some similar ordering when doing similarity scores for ranking things, but I'm not sure when one should be used versus the other.
@philiphartmankt My confusion there is that the output of applying the mean pooling to the last_hidden_state output does not equal the pooler_output produced by the model, so I don't understand their different sources/uses
so I'm wondering why this is when the model seems designed for 256?
The reasoning is that the training data was shorter than 256 tokens. After testing, larger sequences (e.g. 512) performed worse than truncating those same sequences to 256, so the model authors restricted the maximum sequence length to 256. This is implemented in sentence_bert_config.json, rather than in transformers
. As a result, when you use pure transformers
, you'll get the reduced performance from the sequence length of 512.
My confusion there is that the output of applying the mean pooling to the last_hidden_state output does not equal the pooler_output produced by the model, so I don't understand their different sources/uses
Most likely, the Sentence Transformers model truncated to 256, whereas your manual mean used the sequence length of 512. Alternatively, you might need to only perform mean pooling over the tokens with "attention_mask", rather than over all tokens.
- Tom Aarsen
@tomaarsen For your first point, is there no way to override this default specifically for this model? Seems odd to me to use a value known to cause worse performance.
To your second point, when I was computing the last_hidden_state & pooler output manually I was using 256 as the max_length with the tokenizer so I'm still a little confused as to why they would be different.
Appreciate any insight you might have.