Add infinity as example deployment
#22
by
michaelfeil
- opened
No description provided.
docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \ "7997":"7997" \
michaelf34/infinity:0.0.68 \
v2 --model-id WhereIsAI/UAE-Large-V1 --revision "369c368f70f16a613f19f5598d4f12d9f44235d4" --dtype float16 --batch-size 32 --device cuda --engine torch --port 7997
INFO 2024-11-12 23:37:34,638 infinity_emb INFO: infinity_server.py:89
Creating 1 engines:
engines=['WhereIsAI/UAE-Large-V1']
INFO 2024-11-12 23:37:34,649 infinity_emb INFO: select_model.py:64
model=`WhereIsAI/UAE-Large-V1` selected, using
engine=`torch` and device=`cuda`
INFO 2024-11-12 23:37:34,653 SentenceTransformer.py:216
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer:
WhereIsAI/UAE-Large-V1
INFO 2024-11-12 23:37:36,676 infinity_emb INFO: Adding acceleration.py:56
optimizations via Huggingface optimum.
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO 2024-11-12 23:37:36,995 infinity_emb INFO: Getting select_model.py:97
timings for batch_size=32 and avg tokens per
sentence=1
3.17 ms tokenization
7.74 ms inference
0.08 ms post-processing
10.99 ms total
embeddings/sec: 2910.49
INFO 2024-11-12 23:37:37,185 infinity_emb INFO: Getting select_model.py:103
timings for batch_size=32 and avg tokens per
sentence=512
16.29 ms tokenization
65.72 ms inference
0.14 ms post-processing
82.15 ms total
embeddings/sec: 389.52
INFO 2024-11-12 23:37:37,187 infinity_emb INFO: model select_model.py:104
warmed up, between 389.52-2910.49 embeddings/sec at
batch_size=32
INFO 2024-11-12 23:37:37,190 SentenceTransformer.py:216
sentence_transformers.SentenceTransformer
INFO: Load pretrained SentenceTransformer:
WhereIsAI/UAE-Large-V1
INFO 2024-11-12 23:37:38,179 infinity_emb INFO: Adding acceleration.py:56
optimizations via Huggingface optimum.
INFO 2024-11-12 23:37:47,633 infinity_emb INFO: infinity_server.py:104
♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023-now Michael Feil
Version 0.0.68
Open the Docs via Swagger UI:
http://0.0.0.0:7997/docs
Access all deployed models via 'GET':
curl http://0.0.0.0:7997/models
Visit the docs for more information:
https://michaelfeil.github.io/infinity
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)
@SeanLee97 Can you review?
@michaelfeil Thank you for the PR! I've followed infinity. It is super cool!
For the PR, does it work if I change michaelf34/infinity:0.0.68
to michaelf34/infinity:latest
?
yes, but it might be maximally stable if we add a --revision 584fb280384b508a5ca77547a6f0d98d64809e32
and pin to a specific version.
latest
tag currently is just the same image as 0.0.70
. Should be compatible with all releases since circa 0.0.35
great work! I'll merge it. Thank you!
SeanLee97
changed pull request status to
merged
@SeanLee97 Just saw you are also baker at mixedbread! Awesome work there!
It works fast, except for Deberta based models & inference only! Your team is aware of it :) https://github.com/mixedbread-ai/batched?tab=readme-ov-file#attribution