hltcoe
/

plaidx-large-clef-mtd-mix-entries-mt5xxl-engeng

xlm-roberta-large

Inference Endpoints

Model card Files Files and versions Community

plaidx-large-clef-mtd-mix-entries-mt5xxl-engeng / README.md

eugene-yang's picture

git update readme

fc35774 about 2 months ago

|

raw history blame contribute delete

No virus

3.39 kB

	---
	language:
	- en
	- de
	- es
	- fr
	tags:
	- clir
	- colbertx
	- plaidx
	- xlm-roberta-large
	datasets:
	- ms_marco
	- hltcoe/tdist-msmarco-scores
	task_categories:
	- text-retrieval
	- information-retrieval
	task_ids:
	- passage-retrieval
	- cross-language-retrieval
	license: mit
	---

	# ColBERT-X for English-German/Spanish/French MLIR using Multilingual Translate-Distill

	## MLIR Model Setting

	- Query language: English
	- Query length: 32 token max
	- Document language: German/Spanish/French
	- Document length: 180 token max (please use MaxP to aggregate the passage score if needed)

	## Model Description

	Multilingual Translate-Distill is a training technique that produces state-of-the-art MLIR dense retrieval model through translation and distillation.
	`plaidx-large-clef-mtd-mix-entries-mt5xxl-engeng` is trained with KL-Divergence from the `mt5xxl` MonoT5 reranker
	[`unicamp-dl/mt5-13b-mmarco-100k`](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k)
	inferenced on English MS MARCO training queries and passages.
	The teacher scores can be found in
	[`hltcoe/tdist-msmarco-scores`](https://huggingface.co/datasets/hltcoe/tdist-msmarco-scores/blob/main/t53b-monot5-msmarco-engeng.jsonl.gz).

	### Training Parameters

	- learning rate: 5e-6
	- update steps: 200,000
	- nway (number of passages per query): 6 (randomly selected from 50; 2 if using `round-robin-entires`, see below)
	- per device batch size (number of query-passage set): 8
	- training GPU: 8 NVIDIA V100 with 32 GB memory

	### Mixing Strategies

	- `mix-passages`: languages are randomly assigned to the 6 sampled passages for a given query during training.
	- `mix-entries`: all passages in the a given query-passage set are randomly assigned to the same language.
	- `round-robin-entires`: for each query, the query-passage set is repeated `n` times to iterate through all languages.

	## Usage

	To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.
	```bash
	pip install PLAID-X>=0.3.1
	```

	Following code snippet loads the model through Huggingface API.
	```python
	from colbert.modeling.checkpoint import Checkpoint
	from colbert.infra import ColBERTConfig

	Checkpoint('hltcoe/plaidx-large-clef-mtd-mix-entries-mt5xxl-engeng', colbert_config=ColBERTConfig())
	```

	For full tutorial, please refer to the [PLAID-X Jupyter Notebook](https://colab.research.google.com/github/hltcoe/clir-tutorial/blob/main/notebooks/clir_tutorial_plaidx.ipynb),
	which is part of the [SIGIR 2023 CLIR Tutorial](https://github.com/hltcoe/clir-tutorial).

	## BibTeX entry and Citation Info

	Please cite the following two papers if you use the model.


	```bibtex
	@inproceedings{mtt,
	title = {Neural Approaches to Multilingual Information Retrieval},
	author = {Dawn Lawrie and Eugene Yang and Douglas W Oard and James Mayfield},
	booktitle = {Proceedings of the 45th European Conference on Information Retrieval (ECIR)},
	year = {2023},
	doi = {10.1007/978-3-031-28244-7_33},
	url = {https://arxiv.org/abs/2209.01335}
	}
	```

	```bibtex
	@inproceedings{mtd,
	author = {Eugene Yang and Dawn Lawrie and James Mayfield},
	title = {Distillation for Multilingual Information Retrieval},
	booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper) (Accepted)},
	year = {2024}
	url = {https://arxiv.org/abs/2405.00977}
	}
	```