nomic-ai/nomic-embed-text-v1 · Add AutoTokenizer & Sentence Transformers support

tomaarsen

Nomic AI org Feb 1, 2024

•

edited Feb 1, 2024

Hello!

Pull Request overview

Add AutoTokenizer support.
Add Sentence Transformers support
Update some README metadata

Details

AutoTokenizer support

I saved the bert-base-uncased tokenizer into this repository (but with the max_model_length set to 8192), then you can use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1")

Add Sentence Transformers support

return_dict was required, but it can be ignored as ST only uses return_dict=False. I also added the required files.

To experiment, feel free to run this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True, revision="pr/1")
sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)

It takes the model from this PR branch. You'll see that the embeddings match the mean pooled & normalized embeddings from the Transformers-based snippet.

Metadata

The metadata is used to tell Hugging Face that the model can be loaded with ST, this also creates a "Use with Sentence Transformers" button, for example; might boost the sharability of the model 💪

I also updated the README slightly. Feel free to make any suggestions or changes - it's your model after all :)

Note: The scarily large PR diff (60k lines) is because of the vocab.txt from the tokenizer.

Tom Aarsen

Merge branch 'main' into integration/sentence_transformers790cf31f

Remove accidental .vscode pushd46c50ab

tomaarsen changed pull request status to open Feb 1, 2024

zpn changed pull request status to merged Feb 1, 2024

zpn

Nomic AI org Feb 1, 2024

thank you!