How could I use vllm to serve this gte model？

#25

by qiujiaji - opened 9 days ago

9 days ago

I try to use vllm to serve this model but found this 'NewModel' is not supported. I try to use vllm because I found it's a bit slow when I deploy gte with litserver: it reaches 1300 qps with 8 H20, which is only a bit faster than Qwen-0.5B. I think it should be a lot faster since gte only infers a embedding while Qwen-0.5B infers multiple tokens in my case.

qiujiaji changed discussion title from How to I use vllm to serve this gte model？ to How could I use vllm to serve this gte model？ 8 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment