Slow inference performance when using nomic-embed-text-v1.5

#34
by umesh-c - opened

Hello there,
After considering multiple aspects of this model, we thought to give it a shot over bge-large-en. The first observation is that it is running pretty slow even when running on GPU. My code looks like this :

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from models.semantic_search_gen import SemanticSearchGen

class NomicEmbedText(SemanticSearchGen):
    """Implementation to use vector embedding model nomic-embed-text-v1.5"""
    def __init__(self):
        """
        Constructor to initialize the model
        """
        super().__init__("nomic-embed-text-v1.5")
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
        self.model = AutoModel.from_pretrained(self.model_path, trust_remote_code=True, safe_serialization=True)
        self.model.eval()
        self.model.max_seq_length = self.model_conf["token_limit"]
        self.token_limit = self.model.max_seq_length

    def mean_pooling(self, model_output, attention_mask):
        """
        Implementation to calculate embeddings after mean pooling
        :param model_output:
        :param attention_mask:
        :return: Vector embeddings after mean pooling
        """
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    def gen_embeddings(self, text, prompt=False, text_type="search_document"):
        """
        Generate vector embeddings
        :param text: input text to generate the embeddings
        :param prompt: Not applicable here
        :param text_type: Default is "search_document:" to generate the embeddings for search document. If embedding is getting generated for search query then it needs to passed here
        :return: Vector embeddings in the flot values list
        """

        if text_type == "search_document":
            instruction = "search_document: "
        else:
            instruction = "search_query: "

        encoded_input = self.tokenizer(instruction + text, padding=True, truncation=True, return_tensors='pt')
        with torch.no_grad():
            model_output = self.model(**encoded_input)
        embeddings = self.mean_pooling(model_output, encoded_input['attention_mask'])
        embeddings = F.normalize(embeddings, p=2, dim=1)
        # Convert embeddings from Tensor to NumPy and return an array of floats
        embeddings = embeddings.numpy()[0]
        return embeddings

    def get_model(self):
        return self.model

Can the slowness be due to pooling calculation or not utilizing GPUs while calculating the embeddings?
Due to some conversion issue, I was not able to run it via sentence-transformers as it was giving me some torch conversion related stacktrace.

Thanks!

Nomic AI org
edited Aug 30

Hello!

I can think of two causes here:

  1. (Most likely) Nomic's tokenizer accepts much longer inputs than bge-large-en-v1.5: 8192 instead of 512. This means that the model has to process a ton more tokens, and most encoder models get exponentially slower the longer the inputs, so this is a very likely cause. If you don't want to get increased inference times, you might want to set tokenizer.model_max_length = 512 and then try to test whether you get improved performance.
  2. (Less likely) bge-large-en-v1.5 via SentenceTransformers automatically does batching (32 samples per inference by default, I believe). This is quite a bit faster than doing 32 inferences of 1 sample.

Also,

Due to some conversion issue, I was not able to run it via sentence-transformers as it was giving me some torch conversion related stacktrace.

This is a shame :/ Could you post the stacktrace when you run:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
sentences = ['search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten']
embeddings = model.encode(sentences)
print(embeddings)
  • Tom Aarsen

Hi Tom,
Thanks for pointing the likely causes behind the slowness.
I am going to try tokenizer.model_max_length = 512 and will update you shortly.

Regarding the inability to use SentenceTransformer, here is the stacktrace which I get whenever I use SentenceTransformer for nomic :

/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
<All keys matched successfully>
Fatal Python error: Aborted

Thread 0x00000002a72ff000 (most recent call first):
  File "/opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 331 in wait
  File "/opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 629 in wait
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1002 in _bootstrap

<Truncated some other threads to show the main culprit, which is below> 

Current thread 0x00000001f59e2500 (most recent call first):
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160 in convert
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 805 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 780 in _apply
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1174 in to
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 215 in __init__
  File "/Users/umesh/git-repos/genai_search/models/impl/nomic_embed_text_v1.py", line 18 in __init__
  File "/Users/umesh/git-repos/genai_search/models/model_factory.py", line 28 in __getattr__
  File "/Users/umesh/git-repos/genai_search/streaming_processor.py", line 315 in apply_cleanup_and_store
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/util.py", line 81 in wrapper
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 1745 in processPartition
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 828 in func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/rdd.py", line 5405 in pipeline_func
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 820 in process
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 830 in main
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 74 in worker
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 193 in manager
  File "/Users/umesh/git-repos/genai_search/cxg/lib/python3.11/site-packages/pyspark/python/lib/pyspark.zip/pyspark/daemon.py", line 218 in <module>
  File "<frozen runpy>", line 88 in _run_code
  File "<frozen runpy>", line 198 in _run_module_as_main 
pip list | grep -ia transformer                                                                                                                  
sentence-transformers                   2.6.1
transformers                            4.37.0

Thanks!

Here is the torch version :

pip list | grep torch                                                                                                                                 
torch                                   2.4.0
torchvision                             0.19.0

Unfortunately setting tokenizer.model_max_length = 512 doesn't seem to give any performance boost in my case.

BTW thanks for asking about the SentenceTransformer issue I revisited it and was able to fix it by passing device parameter. Strangely the above mentioned issue in thread dump is gone when using below lines, earlier I was not passing device param :

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model = SentenceTransformer(self.model_path, trust_remote_code=True, device=device)

Also I must say SentenceTransformer way more faster than plain Transformers, I still don't know why ;)

Thanks!

Nomic AI org

From the code above, it doesn't seem like you are putting either the model or inputs on the GPU which could explain the slowness of transformers. IIRC, sentence transformers handles a lot of this especially when passing device. To verify this, you check the output of nvidia-smi when running the transformers code and it should show you some usage while your code is running.

Sign up or log in to comment