How could I use vllm to serve this gte model?
#25
by
qiujiaji
- opened
I try to use vllm to serve this model but found this 'NewModel' is not supported. I try to use vllm because I found it's a bit slow when I deploy gte with litserver: it reaches 1300 qps with 8 H20, which is only a bit faster than Qwen-0.5B. I think it should be a lot faster since gte only infers a embedding while Qwen-0.5B infers multiple tokens in my case.
qiujiaji
changed discussion title from
How to I use vllm to serve this gte model?
to How could I use vllm to serve this gte model?