Loading the non-quantized model in browser

#1
by nbolton04 - opened

My workflow is to embed the documents on GPU. I tried using the quantized model in Python on GPU but saw a significant decrease in performance, and it seems it is at over 1000% CPU usage even with batch size of 1. My next step was to try and load the non-quantized model via transformers.js seeing as i rather the client-side browser inference be slow than having initial processing take that much longer. However, when I do that I get an error. Do you have suggestions or examples of how to do the latter?

Sign up or log in to comment