LucasF dangvantuan commited on
Commit
aa8d8fe
0 Parent(s):

Duplicate from Lajavaness/sentence-camembert-base

Browse files

Co-authored-by: Van Tuan DANG <dangvantuan@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language: fr
4
+ datasets:
5
+ - stsb_multi_mt
6
+ tags:
7
+ - Text
8
+ - Sentence Similarity
9
+ - Sentence-Embedding
10
+ - camembert-base
11
+ license: apache-2.0
12
+ model-index:
13
+ - name: sentence-camembert-base by Van Tuan DANG
14
+ results:
15
+ - task:
16
+ name: Sentence-Embedding
17
+ type: Text Similarity
18
+ dataset:
19
+ name: Text Similarity fr
20
+ type: stsb_multi_mt
21
+ args: fr
22
+ metrics:
23
+ - name: Test Pearson correlation coefficient
24
+ type: Pearson_correlation_coefficient
25
+ value: 86.88
26
+ library_name: sentence-transformers
27
+ ---
28
+
29
+ ## Pre-trained sentence embedding models are the state-of-the-art of Sentence Embeddings for French.
30
+ This model is improved from [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) using fine-tuning with [Augmented SBERT](https://aclanthology.org/2021.naacl-main.28.pdf) on dataset [stsb](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) along with Pair Sampling Strategies through 2 models [CrossEncoder-camembert-large](https://huggingface.co/dangvantuan/CrossEncoder-camembert-large) and [dangvantuan/sentence-camembert-large](https://huggingface.co/dangvantuan/sentence-camembert-large)
31
+ ## Usage
32
+ The model can be used directly (without a language model) as follows:
33
+
34
+ ```python
35
+ from sentence_transformers import SentenceTransformer
36
+ model = SentenceTransformer("Lajavaness/sentence-camembert-base")
37
+
38
+ sentences = ["Un avion est en train de décoller.",
39
+ "Un homme joue d'une grande flûte.",
40
+ "Un homme étale du fromage râpé sur une pizza.",
41
+ "Une personne jette un chat au plafond.",
42
+ "Une personne est en train de plier un morceau de papier.",
43
+ ]
44
+
45
+ embeddings = model.encode(sentences)
46
+ ```
47
+
48
+ ## Evaluation
49
+ The model can be evaluated as follows on the French test data of stsb.
50
+
51
+ ```python
52
+ from sentence_transformers import SentenceTransformer
53
+ from sentence_transformers.readers import InputExample
54
+ from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
55
+ from datasets import load_dataset
56
+ def convert_dataset(dataset):
57
+ dataset_samples=[]
58
+ for df in dataset:
59
+ score = float(df['similarity_score'])/5.0 # Normalize score to range 0 ... 1
60
+ inp_example = InputExample(texts=[df['sentence1'],
61
+ df['sentence2']], label=score)
62
+ dataset_samples.append(inp_example)
63
+ return dataset_samples
64
+
65
+ # Loading the dataset for evaluation
66
+ df_dev = load_dataset("stsb_multi_mt", name="fr", split="dev")
67
+ df_test = load_dataset("stsb_multi_mt", name="fr", split="test")
68
+
69
+ # Convert the dataset for evaluation
70
+
71
+ # For Dev set:
72
+ dev_samples = convert_dataset(df_dev)
73
+ val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')
74
+ val_evaluator(model, output_path="./")
75
+
76
+ # For Test set:
77
+ test_samples = convert_dataset(df_test)
78
+ test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
79
+ test_evaluator(model, output_path="./")
80
+ ```
81
+
82
+ **Test Result**:
83
+ The performance is measured using Pearson and Spearman correlation on the sts-benchmark:
84
+ - On dev
85
+
86
+
87
+ | Model | Pearson correlation | Spearman correlation | #params |
88
+ | ------------- | ------------- | ------------- |------------- |
89
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base)| 86.88 |86.73 | 110M |
90
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base)| 86.73 |86.54 | 110M |
91
+ [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)| 85.85 |85.71 | 137M |
92
+ | [distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 79.22 | 79.16|135M |
93
+
94
+
95
+ - On test: Pearson and Spearman correlation are evaluated on many different benchmarks dataset:
96
+
97
+ **Pearson score**
98
+ | Model | [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) | [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) | params |
99
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
100
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) | 83.46 | 84.49 | 84.61 | 83.94 | 86.94 | 75.20 | 82.86 | 110M |
101
+ | [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 82.82 | 84.79 | 85.76 | 82.81 | 85.38 | 74.05 | 82.23 | 137M |
102
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 82.36 | 82.06 | 84.08 | 81.51 | 85.54 | 73.97 | 80.91 | 110M |
103
+ | [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased)| 78.63 | 72.51 | 67.25 | 70.12 | 79.93 | 66.67 | 77.76 | 135M |
104
+ | [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) | 78.38 | 79.00 | 77.61 | 76.56 | 79.03 | 71.22 | 80.58 | 137M |
105
+ | [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 76.97 | 71.43 | 73.50 | 70.56 | 78.44 | 71.23 | 77.62 | 110M |
106
+
107
+
108
+ **Spearman score**
109
+ | Model | [STS-B](https://huggingface.co/datasets/stsb_multi_mt/viewer/fr/train) | [STS12-fr ](https://huggingface.co/datasets/Lajavaness/STS12-fr)| [STS13-fr](https://huggingface.co/datasets/Lajavaness/STS13-fr) | [STS14-fr](https://huggingface.co/datasets/Lajavaness/STS14-fr) | [STS15-fr](https://huggingface.co/datasets/Lajavaness/STS15-fr) | [STS16-fr](https://huggingface.co/datasets/Lajavaness/STS16-fr) | [SICK-fr](https://huggingface.co/datasets/Lajavaness/SICK-fr) | params |
110
+ |-----------------------------------------------------------|---------|----------|----------|----------|----------|----------|---------|--------|
111
+ | [Lajavaness/sentence-camembert-base](https://huggingface.co/Lajavaness/sentence-camembert-base) | 82.92 | 77.71 | 84.19 | 81.83 | 87.04 | 76.81 | 76.36 | 110M |
112
+ | [inokufu/flaubert-base-uncased-xnli-sts](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 83.07 | 77.34 | 85.88 | 80.96 | 85.70 | 76.43 | 77.00 | 137M |
113
+ | [dangvantuan/sentence-camembert-base](https://huggingface.co/dangvantuan/sentence-camembert-base) | 81.64 | 75.45 | 83.86 | 78.63 | 85.66 | 75.36 | 74.18 | 110M |
114
+ | [sentence-transformers/distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 77.49 | 69.80 | 68.85 | 68.17 | 80.27 | 70.04 | 72.49 | 135M |
115
+ | [hugorosen/flaubert_base_uncased-xnli-sts](https://huggingface.co/hugorosen/flaubert_base_uncased-xnli-sts) | 76.93 | 68.96 | 77.62 | 71.87 | 79.33 | 72.86 | 73.91 | 137M |
116
+ | [antoinelouis/biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 75.55 | 66.89 | 73.90 | 67.14 | 78.78 | 72.64 | 72.03 | 110M |
117
+
118
+
119
+ ## Citation
120
+
121
+
122
+ @article{reimers2019sentence,
123
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
124
+ author={Nils Reimers, Iryna Gurevych},
125
+ journal={https://arxiv.org/abs/1908.10084},
126
+ year={2019}
127
+ }
128
+
129
+
130
+ @article{martin2020camembert,
131
+ title={CamemBERT: a Tasty French Language Mode},
132
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
133
+ journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
134
+ year={2020}
135
+ }
136
+ @article{thakur2020augmented,
137
+ title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
138
+ author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
139
+ journal={arXiv e-prints},
140
+ pages={arXiv--2010},
141
+ year={2020}
added_tokens.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</s>": 6,
3
+ "</s>NOTUSED": 2,
4
+ "<mask>": 32004,
5
+ "<pad>": 1,
6
+ "<s>": 5,
7
+ "<s>NOTUSED": 0,
8
+ "<unk>": 3
9
+ }
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/home/van.tuan@ljnad.lajavaness.com/.cache/torch/sentence_transformers/dangvantuan_sentence-camembert-base/",
3
+ "architectures": [
4
+ "CamembertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "eos_token_ids": 0,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "layer_norm_eps": 1e-05,
17
+ "max_position_embeddings": 514,
18
+ "model_type": "camembert",
19
+ "num_attention_heads": 12,
20
+ "num_hidden_layers": 12,
21
+ "output_past": true,
22
+ "pad_token_id": 0,
23
+ "position_embedding_type": "absolute",
24
+ "torch_dtype": "float32",
25
+ "transformers_version": "4.34.0",
26
+ "type_vocab_size": 1,
27
+ "use_cache": true,
28
+ "vocab_size": 32005
29
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.1.0",
4
+ "transformers": "4.11.3",
5
+ "pytorch": "1.9.1+cu102"
6
+ }
7
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4393450effe32ccc644bc82e5be67eb15cf2a8183a315633bcbabb73a47a46ea
3
+ size 442554537
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 128,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:988bc5a00281c6d210a5d34bd143d0363741a432fefe741bf71e61b1869d4314
3
+ size 810912
special_tokens_map.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<s>NOTUSED",
4
+ "</s>NOTUSED"
5
+ ],
6
+ "bos_token": "<s>",
7
+ "cls_token": "<s>",
8
+ "eos_token": "</s>",
9
+ "mask_token": "<mask>",
10
+ "pad_token": "<pad>",
11
+ "sep_token": "</s>",
12
+ "unk_token": "<unk>"
13
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>NOTUSED",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>NOTUSED",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "5": {
36
+ "content": "<s>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "6": {
44
+ "content": "</s>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "32004": {
52
+ "content": "<mask>",
53
+ "lstrip": true,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ }
59
+ },
60
+ "additional_special_tokens": [
61
+ "<s>NOTUSED",
62
+ "</s>NOTUSED"
63
+ ],
64
+ "bos_token": "<s>",
65
+ "clean_up_tokenization_spaces": true,
66
+ "cls_token": "<s>",
67
+ "eos_token": "</s>",
68
+ "mask_token": "<mask>",
69
+ "max_length": 128,
70
+ "model_max_length": 1000000000000000019884624838656,
71
+ "pad_to_multiple_of": null,
72
+ "pad_token": "<pad>",
73
+ "pad_token_type_id": 0,
74
+ "padding_side": "right",
75
+ "sep_token": "</s>",
76
+ "sp_model_kwargs": {},
77
+ "stride": 0,
78
+ "tokenizer_class": "CamembertTokenizer",
79
+ "truncation_side": "right",
80
+ "truncation_strategy": "longest_first",
81
+ "unk_token": "<unk>"
82
+ }