Supported models and hardware
We are continually expanding our support for other model types and plan to include them in future updates.
Supported embeddings models
Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT model with Alibi positions and Mistral, Alibaba GTE and Qwen2 models with Rope positions.
Below are some examples of the currently supported models:
MTEB Rank | Model Size | Model Type | Model ID |
---|---|---|---|
1 | 7B (Very Slow) | Mistral | Salesforce/SFR-Embedding-2_R |
15 | 0.4B | Alibaba GTE | Alibaba-NLP/gte-large-en-v1.5 |
20 | 0.3B | Bert | WhereIsAI/UAE-Large-V1 |
24 | 0.5B | XLM-RoBERTa | intfloat/multilingual-e5-large-instruct |
N/A | 0.1B | NomicBert | nomic-ai/nomic-embed-text-v1 |
N/A | 0.1B | NomicBert | nomic-ai/nomic-embed-text-v1.5 |
N/A | 0.1B | JinaBERT | jinaai/jina-embeddings-v2-base-en |
N/A | 0.1B | JinaBERT | jinaai/jina-embeddings-v2-base-code |
To explore the list of best performing text embeddings models, visit the Massive Text Embedding Benchmark (MTEB) Leaderboard.
Supported re-rankers and sequence classification models
Text Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions.
Below are some examples of the currently supported models:
Task | Model Type | Model ID | Revision |
---|---|---|---|
Re-Ranking | XLM-RoBERTa | BAAI/bge-reranker-large | refs/pr/4 |
Re-Ranking | XLM-RoBERTa | BAAI/bge-reranker-base | refs/pr/5 |
Sentiment Analysis | RoBERTa | SamLowe/roberta-base-go_emotions |
Supported hardware
Text Embeddings Inference supports can be used on CPU, Turing (T4, RTX 2000 series, …), Ampere 80 (A100, A30), Ampere 86 (A10, A40, …), Ada Lovelace (RTX 4000 series, …), and Hopper (H100) architectures.
The library does not support CUDA compute capabilities < 7.5, which means V100, Titan V, GTX 1000 series, etc. are not supported. To leverage your GPUs, make sure to install the NVIDIA Container Toolkit, and use NVIDIA drivers with CUDA version 12.2 or higher.
Find the appropriate Docker image for your hardware in the following table:
Architecture | Image |
---|---|
CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 |
Volta | NOT SUPPORTED |
Turing (T4, RTX 2000 series, …) | ghcr.io/huggingface/text-embeddings-inference:turing-1.6 (experimental) |
Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.6 |
Ampere 86 (A10, A40, …) | ghcr.io/huggingface/text-embeddings-inference:86-1.6 |
Ada Lovelace (RTX 4000 series, …) | ghcr.io/huggingface/text-embeddings-inference:89-1.6 |
Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.6 (experimental) |
Warning: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
You can turn Flash Attention v1 ON by using the USE_FLASH_ATTENTION=True
environment variable.