Papers
arxiv:2402.05672

Multilingual E5 Text Embeddings: A Technical Report

Published on Feb 8
· Submitted by akhaliq on Feb 9

Abstract

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

@intfloat Thank you for your work!
When training "mE5-large-instruct" did you use only the synthetic data or synthetic + msmarco or synthetic + full data (I am referring to the notations introduced in "Improving Text Embeddings with Large Language Models")

·

It is the "synthetic data + full data" setting, the same data mixture as the released e5-mistral-7b-instruct model.

Sign up or log in to comment

Models citing this paper 17

Browse 17 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.05672 in a dataset README.md to link it from this page.

Spaces citing this paper 158

Collections including this paper 9