arxiv:2402.05672

Multilingual E5 Text Embeddings: A Technical Report

Published on Feb 8

· Submitted by

akhaliq on Feb 9

Upvote

Authors:

Nan Yang ,

Xiaolong Huang ,

Linjun Yang ,

Rangan Majumder ,

Furu Wei

Abstract

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

View arXiv page View PDF Add to collection

Community

librarian-bot

Feb 10

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Serega6678

Feb 28

@intfloat Thank you for your work!
When training "mE5-large-instruct" did you use only the synthetic data or synthetic + msmarco or synthetic + full data (I am referring to the notations introduced in "Improving Text Embeddings with Large Language Models")