File size: 3,389 Bytes
9db49fc
fc35774
c9c1c7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9db49fc
 
c9c1c7c
 
 
c43e5d4
c9c1c7c
 
 
 
 
 
 
 
 
 
 
fc35774
 
 
c9c1c7c
 
 
 
 
 
 
 
 
 
 
fc35774
 
 
c9c1c7c
 
 
fc35774
c9c1c7c
 
 
 
fc35774
c9c1c7c
 
 
 
 
 
 
fc35774
 
c9c1c7c
 
 
fc35774
c9c1c7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc35774
c9c1c7c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
language:
- en
- de
- es
- fr
tags:
- clir
- colbertx
- plaidx
- xlm-roberta-large
datasets:
- ms_marco
- hltcoe/tdist-msmarco-scores
task_categories:
- text-retrieval
- information-retrieval
task_ids:
- passage-retrieval
- cross-language-retrieval
license: mit
---

# ColBERT-X for English-German/Spanish/French MLIR using Multilingual Translate-Distill

## MLIR Model Setting

- Query language: English
- Query length: 32 token max
- Document language: German/Spanish/French
- Document length: 180 token max (please use MaxP to aggregate the passage score if needed)

## Model Description

Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation.
`plaidx-large-clef-mtd-mix-entries-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker
[`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)
inferenced on English MS MARCO training queries and passages.
The teacher scores can be found in
[`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz).

### Training Parameters

- learning rate: 5e-6
- update steps: 200,000
- nway (number of passages per query): 6 (randomly selected from 50; 2 if using `round-robin-entires`, see below)
- per device batch size (number of query-passage set): 8
- training GPU: 8 NVIDIA V100 with 32 GB memory

### Mixing Strategies

- `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training.
- `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language.
- `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages.

## Usage

To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
```bash
pip install PLAID-X>=0.3.1
```

Following code snippet loads the model through Huggingface API.
```python
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

Checkpoint('hltcoe/plaidx-large-clef-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig())
```

For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).

## BibTeX entry and Citation Info

Please cite the following two papers if you use the model.


```bibtex
@inproceedings{mtt,
	title = {Neural Approaches to Multilingual Information Retrieval},
	author = {Dawn Lawrie and Eugene Yang and Douglas W Oard and James Mayfield},
	booktitle = {Proceedings of the 45th European Conference on Information Retrieval (ECIR)},
	year = {2023},
    doi = {10.1007/978-3-031-28244-7_33},
	url = {https://arxiv.org/abs/2209.01335}
}
```

```bibtex
@inproceedings{mtd,
	author = {Eugene Yang and Dawn Lawrie and James Mayfield},
	title = {Distillation for Multilingual Information Retrieval},
	booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)},
	year = {2024}
        url = {https://arxiv.org/abs/2405.00977}
}
```