antoinelouis commited on
Commit
48e74b7
1 Parent(s): b22343e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-classification
3
+ language: fr
4
+ license: mit
5
+ datasets:
6
+ - unicamp-dl/mmarco
7
+ metrics:
8
+ - recall
9
+ tags:
10
+ - passage-reranking
11
+ library_name: sentence-transformers
12
+ base_model: google/mt5-small
13
+ model-index:
14
+ - name: crossencoder-mt5-small-mmarcoFR
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Passage Reranking
19
+ dataset:
20
+ type: unicamp-dl/mmarco
21
+ name: mMARCO-fr
22
+ config: french
23
+ split: validation
24
+ metrics:
25
+ - type: recall_at_500
26
+ name: Recall@500
27
+ value: 94.54
28
+ - type: recall_at_100
29
+ name: Recall@100
30
+ value: 79.98
31
+ - type: recall_at_10
32
+ name: Recall@10
33
+ value: 51.12
34
+ - type: mrr_at_10
35
+ name: MRR@10
36
+ value: 28.00
37
+ ---
38
+
39
+ # crossencoder-mt5-small-mmarcoFR
40
+
41
+ This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score.
42
+ The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
43
+ retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
44
+ relevance according to the model's predicted scores.
45
+
46
+ ## Usage
47
+
48
+ Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
49
+
50
+ #### Using Sentence-Transformers
51
+
52
+ Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
53
+
54
+ ```python
55
+ from sentence_transformers import CrossEncoder
56
+
57
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
58
+
59
+ model = CrossEncoder('antoinelouis/crossencoder-mt5-small-mmarcoFR')
60
+ scores = model.predict(pairs)
61
+ print(scores)
62
+ ```
63
+
64
+ #### Using FlagEmbedding
65
+
66
+ Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
67
+
68
+ ```python
69
+ from FlagEmbedding import FlagReranker
70
+
71
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
72
+
73
+ reranker = FlagReranker('antoinelouis/crossencoder-mt5-small-mmarcoFR')
74
+ scores = reranker.compute_score(pairs)
75
+ print(scores)
76
+ ```
77
+
78
+ #### Using HuggingFace Transformers
79
+
80
+ Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
81
+
82
+ ```python
83
+ import torch
84
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
85
+
86
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
87
+
88
+ tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-mt5-small-mmarcoFR')
89
+ model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-mt5-small-mmarcoFR')
90
+ model.eval()
91
+
92
+ with torch.no_grad():
93
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
94
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
95
+ print(scores)
96
+ ```
97
+
98
+ ***
99
+
100
+ ## Evaluation
101
+
102
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
103
+ an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
104
+ to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
105
+ the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
106
+
107
+ ***
108
+
109
+ ## Training
110
+
111
+ #### Data
112
+
113
+ We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
114
+ that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
115
+ 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
116
+ distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
117
+ relevant and 50% are irrelevant).
118
+
119
+ #### Implementation
120
+
121
+ The model is initialized from the [google/mt5-small](https://huggingface.co/google/mt5-small) checkpoint and optimized via the binary cross-entropy loss
122
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
123
+ with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
124
+ We use the sigmoid function to get scores between 0 and 1.
125
+
126
+ ***
127
+
128
+ ## Citation
129
+
130
+ ```bibtex
131
+ @online{louis2024decouvrir,
132
+ author = 'Antoine Louis',
133
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
134
+ publisher = 'Hugging Face',
135
+ month = 'mar',
136
+ year = '2024',
137
+ url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
138
+ }
139
+ ```