text-embeddings-inference documentation

Supported models and hardware

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Supported models and hardware

We are continually expanding our support for other model types and plan to include them in future updates.

Supported embeddings models

Text Embeddings Inference currently supports Nomic, BERT, CamemBERT, XLM-RoBERTa models with absolute positions, JinaBERT model with Alibi positions and Mistral, Alibaba GTE and Qwen2 models with Rope positions.

Below are some examples of the currently supported models:

MTEB Rank Model Size Model Type Model ID
1 7B (Very Slow) Mistral Salesforce/SFR-Embedding-2_R
15 0.4B Alibaba GTE Alibaba-NLP/gte-large-en-v1.5
20 0.3B Bert WhereIsAI/UAE-Large-V1
24 0.5B XLM-RoBERTa intfloat/multilingual-e5-large-instruct
N/A 0.1B NomicBert nomic-ai/nomic-embed-text-v1
N/A 0.1B NomicBert nomic-ai/nomic-embed-text-v1.5
N/A 0.1B JinaBERT jinaai/jina-embeddings-v2-base-en
N/A 0.1B JinaBERT jinaai/jina-embeddings-v2-base-code

To explore the list of best performing text embeddings models, visit the Massive Text Embedding Benchmark (MTEB) Leaderboard.

Supported re-rankers and sequence classification models

Text Embeddings Inference currently supports CamemBERT, and XLM-RoBERTa Sequence Classification models with absolute positions.

Below are some examples of the currently supported models:

Task Model Type Model ID Revision
Re-Ranking XLM-RoBERTa BAAI/bge-reranker-large refs/pr/4
Re-Ranking XLM-RoBERTa BAAI/bge-reranker-base refs/pr/5
Sentiment Analysis RoBERTa SamLowe/roberta-base-go_emotions

Supported hardware

Text Embeddings Inference supports can be used on CPU, Turing (T4, RTX 2000 series, …), Ampere 80 (A100, A30), Ampere 86 (A10, A40, …), Ada Lovelace (RTX 4000 series, …), and Hopper (H100) architectures.

The library does not support CUDA compute capabilities < 7.5, which means V100, Titan V, GTX 1000 series, etc. are not supported. To leverage your GPUs, make sure to install the NVIDIA Container Toolkit, and use NVIDIA drivers with CUDA version 12.2 or higher.

Find the appropriate Docker image for your hardware in the following table:

Architecture Image
CPU ghcr.io/huggingface/text-embeddings-inference:cpu-1.6
Volta NOT SUPPORTED
Turing (T4, RTX 2000 series, …) ghcr.io/huggingface/text-embeddings-inference:turing-1.6 (experimental)
Ampere 80 (A100, A30) ghcr.io/huggingface/text-embeddings-inference:1.6
Ampere 86 (A10, A40, …) ghcr.io/huggingface/text-embeddings-inference:86-1.6
Ada Lovelace (RTX 4000 series, …) ghcr.io/huggingface/text-embeddings-inference:89-1.6
Hopper (H100) ghcr.io/huggingface/text-embeddings-inference:hopper-1.6 (experimental)

Warning: Flash Attention is turned off by default for the Turing image as it suffers from precision issues. You can turn Flash Attention v1 ON by using the USE_FLASH_ATTENTION=True environment variable.

< > Update on GitHub