How could I use vllm to serve this gte model?

#25
by qiujiaji - opened

I try to use vllm to serve this model but found this 'NewModel' is not supported. I try to use vllm because I found it's a bit slow when I deploy gte with litserver: it reaches 1300 qps with 8 H20, which is only a bit faster than Qwen-0.5B. I think it should be a lot faster since gte only infers a embedding while Qwen-0.5B infers multiple tokens in my case.

qiujiaji changed discussion title from How to I use vllm to serve this gte model? to How could I use vllm to serve this gte model?

Sign up or log in to comment