antoinelouis
commited on
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: text-classification
|
3 |
+
language: fr
|
4 |
+
license: mit
|
5 |
+
datasets:
|
6 |
+
- unicamp-dl/mmarco
|
7 |
+
metrics:
|
8 |
+
- recall
|
9 |
+
tags:
|
10 |
+
- passage-reranking
|
11 |
+
library_name: sentence-transformers
|
12 |
+
base_model: antoinelouis/camemberta-L2
|
13 |
+
model-index:
|
14 |
+
- name: crossencoder-camemberta-L2-mmarcoFR
|
15 |
+
results:
|
16 |
+
- task:
|
17 |
+
type: text-classification
|
18 |
+
name: Passage Reranking
|
19 |
+
dataset:
|
20 |
+
type: unicamp-dl/mmarco
|
21 |
+
name: mMARCO-fr
|
22 |
+
config: french
|
23 |
+
split: validation
|
24 |
+
metrics:
|
25 |
+
- type: recall_at_500
|
26 |
+
name: Recall@500
|
27 |
+
value: 93.06
|
28 |
+
- type: recall_at_100
|
29 |
+
name: Recall@100
|
30 |
+
value: 73.23
|
31 |
+
- type: recall_at_10
|
32 |
+
name: Recall@10
|
33 |
+
value: 40.89
|
34 |
+
- type: mrr_at_10
|
35 |
+
name: MRR@10
|
36 |
+
value: 21.25
|
37 |
+
---
|
38 |
+
|
39 |
+
# crossencoder-camemberta-L2-mmarcoFR
|
40 |
+
|
41 |
+
This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score.
|
42 |
+
The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
|
43 |
+
retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
|
44 |
+
relevance according to the model's predicted scores.
|
45 |
+
|
46 |
+
## Usage
|
47 |
+
|
48 |
+
Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
|
49 |
+
|
50 |
+
#### Using Sentence-Transformers
|
51 |
+
|
52 |
+
Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
|
53 |
+
|
54 |
+
```python
|
55 |
+
from sentence_transformers import CrossEncoder
|
56 |
+
|
57 |
+
pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
|
58 |
+
|
59 |
+
model = CrossEncoder('antoinelouis/crossencoder-camemberta-L2-mmarcoFR')
|
60 |
+
scores = model.predict(pairs)
|
61 |
+
print(scores)
|
62 |
+
```
|
63 |
+
|
64 |
+
#### Using FlagEmbedding
|
65 |
+
|
66 |
+
Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
|
67 |
+
|
68 |
+
```python
|
69 |
+
from FlagEmbedding import FlagReranker
|
70 |
+
|
71 |
+
pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
|
72 |
+
|
73 |
+
reranker = FlagReranker('antoinelouis/crossencoder-camemberta-L2-mmarcoFR')
|
74 |
+
scores = reranker.compute_score(pairs)
|
75 |
+
print(scores)
|
76 |
+
```
|
77 |
+
|
78 |
+
#### Using HuggingFace Transformers
|
79 |
+
|
80 |
+
Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
|
81 |
+
|
82 |
+
```python
|
83 |
+
import torch
|
84 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
85 |
+
|
86 |
+
pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
|
87 |
+
|
88 |
+
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-camemberta-L2-mmarcoFR')
|
89 |
+
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-camemberta-L2-mmarcoFR')
|
90 |
+
model.eval()
|
91 |
+
|
92 |
+
with torch.no_grad():
|
93 |
+
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
94 |
+
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
|
95 |
+
print(scores)
|
96 |
+
```
|
97 |
+
|
98 |
+
***
|
99 |
+
|
100 |
+
## Evaluation
|
101 |
+
|
102 |
+
The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
|
103 |
+
an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
|
104 |
+
to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
|
105 |
+
the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
|
106 |
+
|
107 |
+
***
|
108 |
+
|
109 |
+
## Training
|
110 |
+
|
111 |
+
#### Data
|
112 |
+
|
113 |
+
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
|
114 |
+
that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
|
115 |
+
12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
|
116 |
+
distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
|
117 |
+
relevant and 50% are irrelevant).
|
118 |
+
|
119 |
+
#### Implementation
|
120 |
+
|
121 |
+
The model is initialized from the [antoinelouis/camemberta-L2](https://huggingface.co/antoinelouis/camemberta-L2) checkpoint and optimized via the binary cross-entropy loss
|
122 |
+
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
|
123 |
+
with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
|
124 |
+
We use the sigmoid function to get scores between 0 and 1.
|
125 |
+
|
126 |
+
***
|
127 |
+
|
128 |
+
## Citation
|
129 |
+
|
130 |
+
```bibtex
|
131 |
+
@online{louis2024decouvrir,
|
132 |
+
author = 'Antoine Louis',
|
133 |
+
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
|
134 |
+
publisher = 'Hugging Face',
|
135 |
+
month = 'mar',
|
136 |
+
year = '2024',
|
137 |
+
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
|
138 |
+
}
|
139 |
+
```
|