antoinelouis commited on
Commit
49268d7
1 Parent(s): 45763c7

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language: fr
4
+ license: apache-2.0
5
+ datasets:
6
+ - unicamp-dl/mmarco
7
+ metrics:
8
+ - recall
9
+ tags:
10
+ - sentence-similarity
11
+ library_name: sentence-transformers
12
+ ---
13
+ # crossencoder-camembert-base-mmarcoFR
14
+
15
+ This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
16
+
17
+ It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
18
+
19
+ ## Usage
20
+ ***
21
+
22
+ #### Sentence-Transformers
23
+
24
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
25
+
26
+ ```bash
27
+ pip install -U sentence-transformers
28
+ ```
29
+
30
+ Then you can use the model like this:
31
+
32
+ ```python
33
+ from sentence_transformers import CrossEncoder
34
+ pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
35
+
36
+ model = CrossEncoder('crossencoder-camembert-base-mmarcoFR')
37
+ scores = model.predict(pairs)
38
+ print(scores)
39
+ ```
40
+
41
+ #### 🤗 Transformers
42
+
43
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:
44
+
45
+ ```python
46
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
47
+ import torch
48
+
49
+ model = AutoModelForSequenceClassification.from_pretrained('crossencoder-camembert-base-mmarcoFR')
50
+ tokenizer = AutoTokenizer.from_pretrained('crossencoder-camembert-base-mmarcoFR')
51
+
52
+ pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
53
+ features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
54
+
55
+ model.eval()
56
+ with torch.no_grad():
57
+ scores = model(**features).logits
58
+ print(scores)
59
+ ```
60
+
61
+ ## Evaluation
62
+ ***
63
+
64
+ We evaluated our model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages.
65
+
66
+ | r-precision | mrr@10 | recall@10 | recall@20 | recall@50 | recall@100 |
67
+ |--------------:|---------:|------------:|------------:|------------:|-------------:|
68
+ | 35.65 | 50.44 | 82.95 | 91.5 | 96.8 | 98.8 |
69
+
70
+ Below, we compared its results with other cross-encoder models fine-tuned on the same dataset:
71
+ | | model | r-precision | mrr@10 | recall@10 (↑) | recall@20 | recall@50 | recall@100 |
72
+ |---:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------:|---------:|------------:|------------:|------------:|-------------:|
73
+ | 1 | **crossencoder-camembert-base-mmarcoFR** | 35.65 | 50.44 | 82.95 | 91.5 | 96.8 | 98.8 |
74
+ | 2 | [crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR) | 34.37 | 51.01 | 82.23 | 90.6 | 96.45 | 98.4 |
75
+ | 3 | [crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR) | 34.22 | 49.2 | 81.7 | 90.9 | 97.1 | 98.9 |
76
+ | 4 | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR) | 29.68 | 46.13 | 80.45 | 87.9 | 93.15 | 96.6 |
77
+ | 5 | [crossencoder-distilcamembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-base-mmarcoFR) | 27.28 | 43.71 | 80.3 | 89.1 | 95.55 | 98.6 |
78
+ | 6 | [crossencoder-roberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-roberta-base-mmarcoFR) | 33.33 | 48.87 | 79.33 | 86.75 | 94.15 | 97.6 |
79
+ | 7 | [crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR) | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 |
80
+ | 8 | [crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR) | 33.92 | 49.33 | 79 | 88.35 | 94.8 | 98.2 |
81
+ | 9 | [crossencoder-msmarco-electra-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-electra-base-mmarcoFR) | 25.52 | 42.46 | 78.73 | 88.85 | 96.55 | 98.85 |
82
+ | 10 | [crossencoder-bert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-bert-base-uncased-mmarcoFR) | 30.48 | 45.79 | 78.35 | 89.45 | 94.15 | 97.45 |
83
+ | 11 | [crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR) | 29.07 | 44.41 | 77.83 | 88.1 | 95.55 | 99 |
84
+ | 12 | [crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR) | 32.92 | 47.56 | 77.27 | 88.15 | 94.85 | 98.15 |
85
+ | 13 | [crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR) | 30.98 | 46.22 | 76.35 | 85.8 | 94.35 | 97.55 |
86
+ | 14 | [crossencoder-MiniLM-L6-H384-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-H384-uncased-mmarcoFR) | 29.23 | 45.12 | 76.08 | 83.7 | 92.65 | 97.45 |
87
+ | 15 | [crossencoder-electra-base-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-discriminator-mmarcoFR) | 28.48 | 43.58 | 75.63 | 86.15 | 93.25 | 96.6 |
88
+ | 16 | [crossencoder-electra-small-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-small-discriminator-mmarcoFR) | 31.83 | 45.97 | 75.13 | 84.95 | 94.55 | 98.15 |
89
+ | 17 | [crossencoder-distilroberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilroberta-base-mmarcoFR) | 28.22 | 42.85 | 74.13 | 84.08 | 94.2 | 98.5 |
90
+ | 18 | [crossencoder-msmarco-TinyBERT-L-6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-6-mmarcoFR) | 28.23 | 42.7 | 73.63 | 85.65 | 92.65 | 98.35 |
91
+ | 19 | [crossencoder-msmarco-TinyBERT-L-4-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-4-mmarcoFR) | 28.6 | 43.19 | 72.17 | 81.95 | 92.8 | 97.4 |
92
+ | 20 | [crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR) | 30.82 | 44.3 | 72.03 | 82.65 | 93.35 | 98.1 |
93
+ | 21 | [crossencoder-distilbert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilbert-base-uncased-mmarcoFR) | 25.47 | 40.11 | 71.37 | 85.6 | 93.85 | 97.95 |
94
+ | 22 | [crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR) | 31.08 | 43.88 | 71.3 | 81.43 | 92.6 | 98.1 |
95
+
96
+ ## Training
97
+ ***
98
+
99
+ #### Background
100
+
101
+ We used the [camembert-base](https://huggingface.co/camembert-base) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).
102
+
103
+ #### Hyperparameters
104
+
105
+ We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.
106
+
107
+ #### Data
108
+
109
+ We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.
110
+
111
+ ## Citation
112
+ ***
113
+
114
+ ```bibtex
115
+ @online{louis2023,
116
+ author = 'Antoine Louis',
117
+ title = 'crossencoder-camembert-base-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
118
+ publisher = 'Hugging Face',
119
+ month = 'september',
120
+ year = '2023',
121
+ url = 'https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR',
122
+ }
123
+ ```
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "camembert-base",
3
+ "architectures": [
4
+ "CamembertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 5,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 6,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "LABEL_0"
15
+ },
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 3072,
18
+ "label2id": {
19
+ "LABEL_0": 0
20
+ },
21
+ "layer_norm_eps": 1e-05,
22
+ "max_position_embeddings": 514,
23
+ "model_type": "camembert",
24
+ "num_attention_heads": 12,
25
+ "num_hidden_layers": 12,
26
+ "output_past": true,
27
+ "pad_token_id": 1,
28
+ "position_embedding_type": "absolute",
29
+ "torch_dtype": "float32",
30
+ "transformers_version": "4.28.1",
31
+ "type_vocab_size": 1,
32
+ "use_cache": true,
33
+ "vocab_size": 32005
34
+ }
dev_scores.csv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ r-precision,mrr@10,recall@10,recall@20,recall@50,recall@100,model
2
+ 35.65,50.44,82.95,91.50,96.80,98.80,crossencoder-camembert-base-mmarcoFR
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:324dc3236fb67814bab4d2faccf7293865fa64333c0347a285e812d913c70f34
3
+ size 442564277
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:988bc5a00281c6d210a5d34bd143d0363741a432fefe741bf71e61b1869d4314
3
+ size 810912
special_tokens_map.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<s>NOTUSED",
4
+ "</s>NOTUSED"
5
+ ],
6
+ "bos_token": "<s>",
7
+ "cls_token": "<s>",
8
+ "eos_token": "</s>",
9
+ "mask_token": {
10
+ "content": "<mask>",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<pad>",
17
+ "sep_token": "</s>",
18
+ "unk_token": "<unk>"
19
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<s>NOTUSED",
4
+ "</s>NOTUSED"
5
+ ],
6
+ "bos_token": "<s>",
7
+ "clean_up_tokenization_spaces": true,
8
+ "cls_token": "<s>",
9
+ "eos_token": "</s>",
10
+ "mask_token": {
11
+ "__type": "AddedToken",
12
+ "content": "<mask>",
13
+ "lstrip": true,
14
+ "normalized": true,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "model_max_length": 512,
19
+ "pad_token": "<pad>",
20
+ "sep_token": "</s>",
21
+ "tokenizer_class": "CamembertTokenizer",
22
+ "unk_token": "<unk>"
23
+ }