antoinelouis commited on
Commit
78311e3
1 Parent(s): 02acde6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -24
README.md CHANGED
@@ -10,6 +10,36 @@ tags:
10
  - passage-reranking
11
  library_name: sentence-transformers
12
  base_model: dbmdz/electra-base-french-europeana-cased-discriminator
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
  # crossencoder-electra-base-french-mmarcoFR
@@ -75,17 +105,10 @@ print(scores)
75
 
76
  ## Evaluation
77
 
78
- We evaluate the model on 500 random training queries from [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) (which were excluded from training) by reranking
79
- subsets of candidate passages comprising of at least one relevant and up to 200 BM25 negative passages for each query. Below, we compare the model performance with other
80
- cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
81
-
82
- | | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 |
83
- |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
84
- | 1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | fr | 110M | 443MB | 35.65 | 50.44 | 82.95 | 91.50 | 96.80 | 98.80 |
85
- | 2 | [crossencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | fr,99+ | 118M | 471MB | 34.37 | 51.01 | 82.23 | 90.60 | 96.45 | 98.40 |
86
- | 3 | [crossencoder-distilcamembert-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR) | fr | 68M | 272MB | 27.28 | 43.71 | 80.30 | 89.10 | 95.55 | 98.60 |
87
- | 4 | **crossencoder-electra-base-french-mmarcoFR** | fr | 110M | 443MB | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 |
88
- | 5 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR) | fr,99+ | 107M | 428MB | 33.92 | 49.33 | 79.00 | 88.35 | 94.80 | 98.20 |
89
 
90
  ***
91
 
@@ -94,28 +117,29 @@ cross-encoder models fine-tuned on the same dataset. We report the R-precision (
94
  #### Data
95
 
96
  We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
97
- that contains 8.8M passages and 539K training queries. We sample 1M question-passage pairs from the official ~39.8M
98
- [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are
99
- relevant and 75% are irrelevant).
 
100
 
101
  #### Implementation
102
 
103
  The model is initialized from the [dbmdz/electra-base-french-europeana-cased-discriminator](https://huggingface.co/dbmdz/electra-base-french-europeana-cased-discriminator) checkpoint and optimized via the binary cross-entropy loss
104
- (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 10 epochs (i.e., 312.4k steps) using the AdamW optimizer
105
- with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence length of the
106
- concatenated question-passage pairs to 512 tokens. We use the sigmoid function to get scores between 0 and 1.
107
 
108
  ***
109
 
110
  ## Citation
111
 
112
  ```bibtex
113
- @online{louis2023,
114
- author = 'Antoine Louis',
115
- title = 'crossencoder-electra-base-french-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
116
- publisher = 'Hugging Face',
117
- month = 'september',
118
- year = '2023',
119
- url = 'https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR',
120
  }
121
  ```
 
10
  - passage-reranking
11
  library_name: sentence-transformers
12
  base_model: dbmdz/electra-base-french-europeana-cased-discriminator
13
+ model-index:
14
+ - name: crossencoder-electra-base-french-mmarcoFR
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Passage Rerankingg
19
+ dataset:
20
+ type: unicamp-dl/mmarco
21
+ name: mMARCO-fr
22
+ config: french
23
+ split: validation
24
+ metrics:
25
+ - type: recall_at_500
26
+ name: Recall@500
27
+ value: 0.0
28
+ - type: recall_at_100
29
+ name: Recall@100
30
+ value: 0.0
31
+ - type: recall_at_10
32
+ name: Recall@10
33
+ value: 0.0
34
+ - type: map_at_10
35
+ name: MAP@10
36
+ value: 0.0
37
+ - type: ndcg_at_10
38
+ name: nDCG@10
39
+ value: 0.0
40
+ - type: mrr_at_10
41
+ name: MRR@10
42
+ value: 0.0
43
  ---
44
 
45
  # crossencoder-electra-base-french-mmarcoFR
 
105
 
106
  ## Evaluation
107
 
108
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
109
+ an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
110
+ to be reranked. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs
111
+ (R@k). To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
 
 
 
 
 
 
 
112
 
113
  ***
114
 
 
117
  #### Data
118
 
119
  We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
120
+ that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
121
+ 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
122
+ distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
123
+ relevant and 50% are irrelevant).
124
 
125
  #### Implementation
126
 
127
  The model is initialized from the [dbmdz/electra-base-french-europeana-cased-discriminator](https://huggingface.co/dbmdz/electra-base-french-europeana-cased-discriminator) checkpoint and optimized via the binary cross-entropy loss
128
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
129
+ with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
130
+ We use the sigmoid function to get scores between 0 and 1.
131
 
132
  ***
133
 
134
  ## Citation
135
 
136
  ```bibtex
137
+ @online{louis2024decouvrir,
138
+ author = 'Antoine Louis',
139
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
140
+ publisher = 'Hugging Face',
141
+ month = 'mar',
142
+ year = '2024',
143
+ url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
144
  }
145
  ```