The crispy sentence embedding family from Mixedbread.
🪆mxbai-embed-2d-large-v1🪆
This is our 2DMSE sentence embedding model. It supports the adaptive transformer layer and embedding size. Find out more in our blog post.
TLDR: 2D-🪆 allows you to shrink the model and the embeddings layer. Shrinking only the embeddings model yields competetive results to other models like nomics embeddings model. Shrinking the model to ~50% maintains upto 85% of the performance without further training.
Quickstart
Here, we provide several ways to produce sentence embeddings with adaptive layers and embedding sizes. For this version, it is recommended to set adaptive layers from 20 to 24.
sentence-transformers
Currently, the best way to use our models is with the most recent version of sentence-transformers.
python -m pip install -U sentence-transformers
from sentence_transformers import models, SentenceTransformer
from sentence_transformers.util import cos_sim
# 1. load model with `cls` pooling
model = SentenceTransformer("mixedbread-ai/mxbai-embed-2d-large-v1")
# 2. set adaptive layer and embedding size.
# it is recommended to set layers from 20 to 24.
new_num_layers = 22 # 1D: set layer size
model[0].auto_model.encoder.layer = model[0].auto_model.encoder.layer[:new_num_layers]
new_embedding_size = 768 # 2D: set embedding size
# 3. encode
embeddings = model.encode(
[
'Who is german and likes bread?',
'Everybody in Germany.'
]
)
# Similarity of the first sentence with the other two
similarities = cos_sim(embeddings[0, :new_embedding_size], embeddings[1, :new_embedding_size])
print('similarities:', similarities)
angle-emb
You can also use the lastest angle-emb
for inference, as follows:
python -m pip install -U angle-emb
from angle_emb import AnglE
from sentence_transformers.util import cos_sim
# 1. load model
model = AnglE.from_pretrained("mixedbread-ai/mxbai-embed-2d-large-v1", pooling_strategy='cls').cuda()
# 2. set adaptive layer and embedding size.
# it is recommended to set layers from 20 to 24.
layer_index = 22 # 1d: layer
embedding_size = 768 # 2d: embedding size
# 3. encode
embeddings = model.encode([
'Who is german and likes bread?',
'Everybody in Germany.'
], layer_index=layer_index, embedding_size=embedding_size)
similarities = cos_sim(embeddings[0], embeddings[1:])
print('similarities:', similarities)
Transformers.js
If you haven't already, you can install the Transformers.js JavaScript library from NPM using:
npm i @xenova/transformers
You can then use the model to compute embeddings as follows:
import { pipeline, cos_sim } from '@xenova/transformers';
// Create a feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'mixedbread-ai/mxbai-embed-2d-large-v1', {
quantized: false, // (Optional) remove this line to use the 8-bit quantized model
});
// Compute sentence embeddings (with `cls` pooling)
const sentences = ['Who is german and likes bread?', 'Everybody in Germany.' ];
const output = await extractor(sentences, { pooling: 'cls' });
// Set embedding size and truncate embeddings
const new_embedding_size = 768;
const truncated = output.slice(null, [0, new_embedding_size]);
// Compute cosine similarity
console.log(cos_sim(truncated[0].data, truncated[1].data)); // 0.6979532021425204
Using API
You can use the model via our API as follows:
from mixedbread_ai.client import MixedbreadAI
from sklearn.metrics.pairwise import cosine_similarity
import os
mxbai = MixedbreadAI(api_key="{MIXEDBREAD_API_KEY}")
english_sentences = [
'What is the capital of Australia?',
'Canberra is the capital of Australia.'
]
res = mxbai.embeddings(
input=english_sentences,
model="mixedbread-ai/mxbai-embed-2d-large-v1",
dimensions=512,
)
embeddings = [entry.embedding for entry in res.data]
similarities = cosine_similarity([embeddings[0]], [embeddings[1]])
print(similarities)
The API comes with native INT8 and binary quantization support! Check out the docs for more information.
Evaluation
Please find more information in our blog post.
Community
Please join our Discord Community and share your feedback and thoughts! We are here to help and also always happy to chat.
License
Apache 2.0
- Downloads last month
- 5,227
Spaces using mixedbread-ai/mxbai-embed-2d-large-v1 3
Collection including mixedbread-ai/mxbai-embed-2d-large-v1
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported74.761
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported37.906
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported68.808
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported93.256
- ap on MTEB AmazonPolarityClassificationtest set self-reported90.069
- f1 on MTEB AmazonPolarityClassificationtest set self-reported93.248
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported46.162
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported45.670
- map_at_1 on MTEB ArguAnatest set self-reported37.980
- map_at_10 on MTEB ArguAnatest set self-reported54.918