Add infinity as example deployment

#22
No description provided.
docker run --gpus all -v $PWD/data:/app/.cache -p "7997":"7997" \ "7997":"7997" \
michaelf34/infinity:0.0.68 \
v2 --model-id WhereIsAI/UAE-Large-V1 --revision "369c368f70f16a613f19f5598d4f12d9f44235d4" --dtype float16 --batch-size 32 --device cuda --engine torch --port 7997
INFO     2024-11-12 23:37:34,638 infinity_emb INFO:        infinity_server.py:89
         Creating 1 engines:                                                     
         engines=['WhereIsAI/UAE-Large-V1']                                     
INFO     2024-11-12 23:37:34,649 infinity_emb INFO:           select_model.py:64
         model=`WhereIsAI/UAE-Large-V1` selected, using                         
         engine=`torch` and device=`cuda`                                       
INFO     2024-11-12 23:37:34,653                      SentenceTransformer.py:216
         sentence_transformers.SentenceTransformer                              
         INFO: Load pretrained SentenceTransformer:                             
         WhereIsAI/UAE-Large-V1                                                 
INFO     2024-11-12 23:37:36,676 infinity_emb INFO: Adding    acceleration.py:56
         optimizations via Huggingface optimum.                                 
The class `optimum.bettertransformers.transformation.BetterTransformer` is deprecated and will be removed in a future release.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
/app/.venv/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py:301: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
  hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
INFO     2024-11-12 23:37:36,995 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=1                                                             
                 3.17     ms tokenization                                       
                 7.74     ms inference                                          
                 0.08     ms post-processing                                    
                 10.99    ms total                                              
         embeddings/sec: 2910.49                                                
INFO     2024-11-12 23:37:37,185 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=512                                                           
                 16.29    ms tokenization                                       
                 65.72    ms inference                                          
                 0.14     ms post-processing                                    
                 82.15    ms total                                              
         embeddings/sec: 389.52                                                 
INFO     2024-11-12 23:37:37,187 infinity_emb INFO: model    select_model.py:104
         warmed up, between 389.52-2910.49 embeddings/sec at                    
         batch_size=32                                                          
INFO     2024-11-12 23:37:37,190                      SentenceTransformer.py:216
         sentence_transformers.SentenceTransformer                              
         INFO: Load pretrained SentenceTransformer:                             
         WhereIsAI/UAE-Large-V1                                                 
INFO     2024-11-12 23:37:38,179 infinity_emb INFO: Adding    acceleration.py:56
         optimizations via Huggingface optimum.        
INFO     2024-11-12 23:37:47,633 infinity_emb INFO:       infinity_server.py:104
                                                                                
         ♾️  Infinity - Embedding Inference Server                               
         MIT License; Copyright (c) 2023-now Michael Feil                       
         Version 0.0.68                                                         
                                                                                
         Open the Docs via Swagger UI:                                          
         http://0.0.0.0:7997/docs                                               
                                                                                
         Access all deployed models via 'GET':                                  
         curl http://0.0.0.0:7997/models                                        
                                                                                
         Visit the docs for more information:                                   
         https://michaelfeil.github.io/infinity                                 
                                                                                
                                                                                
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7997 (Press CTRL+C to quit)

@SeanLee97 Can you review?

WhereIsAI org

@michaelfeil Thank you for the PR! I've followed infinity. It is super cool!

For the PR, does it work if I change michaelf34/infinity:0.0.68 to michaelf34/infinity:latest?

yes, but it might be maximally stable if we add a --revision 584fb280384b508a5ca77547a6f0d98d64809e32 and pin to a specific version.

latest tag currently is just the same image as 0.0.70. Should be compatible with all releases since circa 0.0.35

WhereIsAI org

great work! I'll merge it. Thank you!

SeanLee97 changed pull request status to merged

@SeanLee97 Just saw you are also baker at mixedbread! Awesome work there!

It works fast, except for Deberta based models & inference only! Your team is aware of it :) https://github.com/mixedbread-ai/batched?tab=readme-ov-file#attribution

Sign up or log in to comment