How to integreate this model with Sentence Transformers?

#31

by Nelson365487 - opened Feb 27

Feb 27

I see the choice of pooling layer of this model is last token pooling base on the description in the model card section. Since I want to utilize this model with Sentence Transformers function. I try to add the pooling layer after loading the model with "sentence_transformers.models.Transformer". And I initiate the pooling layer with "sentence_transformers.models.Pooling(...,pooling_mode_mean_tokens=False,pooling_mode_lasttoken =True).
Finally, create the model with the pooling layer with "model = SentenceTransformer(modules=[word_embedding_model, pooling_model])"
However, the embeddings of the this custom model is very different from what i would get by following the code in the model card section.
Is there any misunderstanding while I integrate this model with Sentence Transformers? For example the realization of the pooling layer is different which leads to different result on embeddings.

intfloat

Owner Feb 28

Can you provide a minimal code snippet that can reproduce your results?

One issue about integrating with SentenceTransformers is that the tokenizer has to add an EOS token to the end of each input. I believe SentenceTransformers do not handle this automatically.

Jonathan0528

Feb 28

sentence-transformers should have added this new feature for EOS token.
See https://huggingface.co/Salesforce/SFR-Embedding-Mistral/discussions/1.

I have tried the merged configs in Salesforce/SFR-Embedding-Mistral and should work.
Hope to see it in intfloat/e5-mistral-7b-instruct!

Nelson365487

Feb 29

Thanks for your replay @intfloat @Jonathan0528 . I check the add_eos_token in the tokenizer after loading model with SentenceTransformers, and just as @intfloat said, the tokenizer does not add EOS token autimatically. The reason of contradiction on what @Jonathan0528 said might be the version of my SentenceTransformers. My installed version is 2.2.2 which is quite old, I think. After setting the add_eos_token=True and redoing the example everything goes well. Thanks again @intfloat @Jonathan0528 .

Nelson365487 changed discussion status to closed Feb 29

woofadu

Apr 3

•

edited Apr 3

@Nelson365487 Where did you set the add_eos_token=True for this if @Jonathan0528 solution did not work?

Nelson365487

Apr 8

@woofadu , Maybe you can try passing arguments with tokenizer_args while initializing the sentence_transformers.models.Transformer or try modify the tokenizer after the initalization.

woofadu

May 3

@Nelson365487 modifying after initialization worked. Thank you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment