Add AutoTokenizer & Sentence Transformers support
Hello!
Pull Request overview
- Add AutoTokenizer support.
- Add Sentence Transformers support
- Update some README metadata
Details
AutoTokenizer support
I saved the bert-base-uncased tokenizer into this repository (but with the max_model_length set to 8192), then you can use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1")
Add Sentence Transformers support
return_dict
was required, but it can be ignored as ST only uses return_dict=False
. I also added the required files.
To experiment, feel free to run this:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True, revision="pr/1")
sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)
It takes the model from this PR branch. You'll see that the embeddings match the mean pooled & normalized embeddings from the Transformers-based snippet.
Metadata
The metadata is used to tell Hugging Face that the model can be loaded with ST, this also creates a "Use with Sentence Transformers" button, for example; might boost the sharability of the model 💪
I also updated the README slightly. Feel free to make any suggestions or changes - it's your model after all :)
Note: The scarily large PR diff (60k lines) is because of the vocab.txt from the tokenizer.
- Tom Aarsen
thank you!