Does this model only work on GPU?

#16
by xPurity - opened

Hi,
i am fairly new to experimenting with embedding models on my RAG app. Until now i never had problems with swapping out the model for re-indexing my documents, however i am restricted to a device with CPU only. I can't get this model to work on my device so i just wanted to ask if this is possible?
I'm using the sentence transformer python package and try to call the model without the cuda() method however this results in an error when i try to install the package flash attn.

Hi,
I have tried the onnx and quantized versions which use the onnxruntime package and works on CPU only device, however it is very slow because the tokenizer batch every line to a length of 512 even if it's not that size, so you'll have to change this in the tokenizer :
"strategy":"BatchLongest"

Sign up or log in to comment