Sentence Similarity
sentence-transformers
PyTorch
Safetensors
xmod
passage-retrieval
dpr-xm / README.md
antoinelouis's picture
Update README.md
5c98770 verified
|
raw
history blame
10.3 kB
metadata
pipeline_tag: sentence-similarity
datasets:
  - ms_marco
  - sentence-transformers/msmarco-hard-negatives
metrics:
  - recall
tags:
  - passage-retrieval
library_name: sentence-transformers
base_model: facebook/xmod-base
inference: false
language:
  - multilingual
  - af
  - am
  - ar
  - az
  - be
  - bg
  - bn
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - ga
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - uk
  - ur
  - uz
  - vi
  - zh

DPR-XM

🛠️ Usage | 📊 Evaluation | 🤖 Training | 🔗 Citation |

💻 Code | 📄 Paper

This is a multilingual dense single-vector bi-encoder model. It maps questions and paragraphs 768-dimensional dense vectors and can be used for semantic search. The model uses an XMOD backbone, which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and perform zero-shot retrieval across multiple languages.

Usage

Here are some examples for using DPR-XM with Sentence-Transformers, FlagEmbedding, or Huggingface Transformers.

Using Sentence-Transformers

Start by installing the library: pip install -U sentence-transformers. Then, you can use the model like this:

from sentence_transformers import SentenceTransformer

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages 

model = SentenceTransformer('antoinelouis/dpr-xm')
model[0].auto_model.set_default_language(language_code) #Activate the language-specific adapters

q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

Using FlagEmbedding

Start by installing the library: pip install -U FlagEmbedding. Then, you can use the model like this:

from FlagEmbedding import FlagModel

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages 

model = FlagModel('antoinelouis/dpr-xm')
model.model.set_default_language(language_code) #Activate the language-specific adapters

q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

Using Transformers

Start by installing the library: pip install -U transformers. Then, you can use the model like this:

from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize

def mean_pooling(model_output, attention_mask):
    """ Perform mean pooling on-top of the contextualized word embeddings, while ignoring mask tokens in the mean computation."""
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages 

tokenizer = AutoTokenizer.from_pretrained('antoinelouis/dpr-xm')
model = AutoModel.from_pretrained('antoinelouis/dpr-xm')
model.set_default_language(language_code) #Activate the language-specific adapters

q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    q_output = model(**encoded_queries)
    p_output = model(**encoded_passages)
q_embeddings = mean_pooling(q_output, q_input['attention_mask'])
q_embedddings = normalize(q_embeddings, p=2, dim=1)
p_embeddings = mean_pooling(p_output, p_input['attention_mask'])
p_embedddings = normalize(p_embeddings, p=2, dim=1)

similarity = q_embeddings @ p_embeddings.T
print(similarity)

Evaluation

  • mMARCO: We evaluate our model on the small development sets of mMARCO, which consists of 6,980 queries for a corpus of 8.8M candidate passages in 14 languages. Below, we compared its multilingual performance with other retrieval models on the dataset official metrics, i.e., mean reciprocal rank at cut-off 10 (MRR@10).
model Type #Samples #Params en es fr it pt id de ru zh ja nl vi hi ar Avg.
1 BM25 (Pyserini) lexical - - 18.4 15.8 15.5 15.3 15.2 14.9 13.6 12.4 11.6 14.1 14.0 13.6 13.4 11.1 14.2
2 mono-mT5 (Bonfacio et al., 2021) cross-encoder 12.8M 390M 36.6 31.4 30.2 30.3 30.2 29.8 28.9 26.3 24.9 26.7 29.2 25.6 26.6 23.5 28.6
3 mono-mMiniLM (Bonfacio et al., 2021) cross-encoder 80.0M 107M 36.6 30.9 29.6 29.1 28.9 29.3 27.8 25.1 24.9 26.3 27.6 24.7 26.2 21.9 27.8
4 DPR-X (Yang et al., 2022) single-vector 25.6M 550M 24.5 19.6 18.9 18.3 19.0 16.9 18.2 17.7 14.8 15.4 18.5 15.1 15.4 12.9 17.5
5 mE5-base (Wang et al., 2024) single-vector 5.1B 278M 35.0 28.9 30.3 28.0 27.5 26.1 27.1 24.5 22.9 25.0 27.3 23.9 24.2 20.5 26.5
6 mColBERT (Bonfacio et al., 2021) multi-vector 25.6M 180M 35.2 30.1 28.9 29.2 29.2 27.5 28.1 25.0 24.6 23.6 27.3 18.0 23.2 20.9 26.5
7 DPR-XM (ours) single-vector 25.6M 277M 32.7 23.6 23.5 22.3 22.7 22.0 22.1 19.9 18.1 18.7 22.9 18.0 16.0 15.1 21.3
8 ColBERT-XM (ours) multi-vector 6.4M 277M 37.2 28.5 26.9 26.5 27.6 26.3 27.0 25.1 24.6 24.1 27.5 22.6 23.8 19.5 26.2

Training

Data

We use the English training samples from the MS MARCO passage ranking dataset, which contains 8.8M passages and 539K training queries. We do not employ the BM25 netaives provided by the official dataset but instead sample harder negatives mined from 12 distinct dense retrievers, using the msmarco-hard-negatives distillation dataset. Our final training set consists of 25.6M (q, p+, p-) triples.

Implementation

The model is initialized from the xmod-base checkpoint and optimized via the in-batch sampled softmax cross-entropy loss (as in DPR). It is fine-tuned on one 32GB NVIDIA V100 GPU for 200k steps using the AdamW optimizer with a batch size of 128, a peak learning rate of 2e-5 with warm up along the first 10% of training steps and linear scheduling. We set the maximum sequence lengths for both the questions and passages to 128 tokens.


Citation

@article{louis2024modular,
  author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos},
  title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval},
  journal = {CoRR},
  volume = {abs/2402.15059},
  year = {2024},
  url = {https://arxiv.org/abs/2402.15059},
  doi = {10.48550/arXiv.2402.15059},
  eprinttype = {arXiv},
  eprint = {2402.15059},
}