Sentence Transformers

university

https://www.SBERT.net

AI & ML interests

In the following you find models tuned to be used for sentence / text embedding generation. They can be used with the sentence-transformers package.

Recent Activity

tomaarsen authored a paper 3 days ago

MMTEB: Massive Multilingual Text Embedding Benchmark

tomaarsen updated a collection 10 days ago

Embedding Model Datasets

tomaarsen updated a dataset 10 days ago

sentence-transformers/msmarco

View all activity

sentence-transformers's activity

tomaarsen

authored a paper 3 days ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published 5 days ago • 28

tomaarsen

updated a collection 10 days ago

Embedding Model Datasets

Collection

A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 68 items • Updated 10 days ago • 109

tomaarsen

updated a dataset 10 days ago

sentence-transformers/msmarco

Viewer • Updated 10 days ago • 526M • 122 • 1

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 12 days ago

Quantization training technique used for all miniLM L6 v2 quantized model.

#100 opened 21 days ago by

learnerX

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 17 days ago

Discrepancy in max tokens

#101 opened 17 days ago by

KennethEnevoldsen

tomaarsen

in sentence-transformers/gtr-t5-large 21 days ago

Clarification regarding dimensions for gtr-t5-large embedding model

#3 opened 21 days ago by

ksridhar-123

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 25 days ago

Unable to load sentence transformer ( was previously working)

#98 opened 25 days ago by

avifin19

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 about 1 month ago

Should I be concerned about "UserWarning: TypedStorage is deprecated. " when using the model with python 3.11?

#97 opened about 1 month ago by

poohlio

tomaarsen

posted an update about 1 month ago

Post

1979

I just released Sentence Transformers v3.4.0, featuring a memory leak fix, compatibility between the powerful Cached... losses and the Matryoshka loss modifier, and a bunch of fixes & small features.

🪆 Matryoshka & Cached loss compatibility
It is now possible to combine the powerful Cached... losses (which use in-batch negatives & a caching mechanism to allow for endless batch size & negatives) with the Matryoshka loss modifier which modifies a base loss such that it is trained not only on the maximum dimensionality (e.g. 1024 dimensions), but also on many lower dimensions (e.g. 768, 512, 256, 128, 64, 32).
After training, these models' embeddings can be truncated for faster retrieval, etc.

🎞️ Resolve memory leak when Model and Trainer are reinitialized
Due to a circular dependency between Trainer -> Model -> ModelCardData -> Trainer, deleting both the trainer & model still didn't free up the memory.
This led to a memory leak in scripts where you repeatedly do so.

➕ New Features
Many new small features, e.g. multi-GPU support for 'mine_hard_negatives', a 'margin' parameter to TripletEvaluator, and Matthews Correlation Coefficient in the BinaryClassificationEvaluator.

🐛 Bug Fixes
Also a bunch of fixes, for example that subsequent batches were not sorted when using the "no_duplicates" batch sampler. See the release notes for more details.

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.4.0

Big thanks to all community members who assisted in this release. 10 folks with their first contribution this time around!

tomaarsen

in sentence-transformers/all-MiniLM-L6-v2 about 1 month ago

Add openvino converted tokenizers

#96 opened about 1 month ago by

rhecker

tomaarsen

updated a Space about 1 month ago

README

❤

tomaarsen

in sentence-transformers/static-similarity-mrl-multilingual-v1 about 1 month ago

Upload ONNX weights

#1 opened about 1 month ago by

Xenova

tomaarsen

in sentence-transformers/static-retrieval-mrl-en-v1 about 1 month ago

Upload ONNX weights

#2 opened about 1 month ago by

Xenova

Quants?

#1 opened about 1 month ago by

ctranslate2-4you

tomaarsen

updated 2 models about 1 month ago

sentence-transformers/static-retrieval-mrl-en-v1

sentence-transformers/static-similarity-mrl-multilingual-v1

tomaarsen

posted an update about 1 month ago

Post

4614

🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0
🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
📜 my training scripts, using the Sentence Transformers library
📊 my Weights & Biases reports with losses & metrics
📕 my list of 30 training and 13 evaluation datasets

The 2 Static Embedding models have the following properties:
🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5'
0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed!
📏 No maximum sequence length! Embed texts at any length (note: longer texts may embed worse)
📐 Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
🪆 Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://huggingface.co/blog/static-embeddings

The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.

Alternatively, check out the models:
* sentence-transformers/static-retrieval-mrl-en-v1
* sentence-transformers/static-similarity-mrl-multilingual-v1