Add AutoTokenizer & Sentence Transformers support

#1
by tomaarsen HF staff - opened

Hello!

Pull Request overview

  • Add AutoTokenizer support.
  • Add Sentence Transformers support
  • Update some README metadata

Details

AutoTokenizer support

I saved the bert-base-uncased tokenizer into this repository (but with the max_model_length set to 8192), then you can use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1")

Add Sentence Transformers support

return_dict was required, but it can be ignored as ST only uses return_dict=False. I also added the required files.

To experiment, feel free to run this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True, revision="pr/1")
sentences = ['What is TSNE?', 'Who is Laurens van der Maaten?']
embeddings = model.encode(sentences)
print(embeddings)

It takes the model from this PR branch. You'll see that the embeddings match the mean pooled & normalized embeddings from the Transformers-based snippet.

Metadata

The metadata is used to tell Hugging Face that the model can be loaded with ST, this also creates a "Use with Sentence Transformers" button, for example; might boost the sharability of the model 💪

I also updated the README slightly. Feel free to make any suggestions or changes - it's your model after all :)

Note: The scarily large PR diff (60k lines) is because of the vocab.txt from the tokenizer.

  • Tom Aarsen
tomaarsen changed pull request status to open
zpn changed pull request status to merged
Nomic AI org

thank you!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment