hakim commited on
Commit
8ab0018
·
1 Parent(s): 60ca23e

update readme

Browse files
Files changed (1) hide show
  1. README.md +88 -49
README.md CHANGED
@@ -2,7 +2,7 @@
2
  language: fr
3
  license: mit
4
  library_name: sentence-transformers
5
- pipeline_tag: sentence-similarity
6
  tags:
7
  - sentence-transformers
8
  - feature-extraction
@@ -24,30 +24,31 @@ model-index:
24
  type: stsb_multi_mt
25
  args: fr
26
  metrics:
27
- - name: Pearson correlation coefficient
28
  type: pearsonr
29
- value: 83.7
30
  ---
31
 
32
- # h4c5/sts-camembert-base
33
 
34
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
 
35
 
36
- <!--- Describe your model here -->
37
 
38
- ## Usage (Sentence-Transformers)
 
39
 
40
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
41
 
42
  ```
43
  pip install -U sentence-transformers
44
  ```
45
 
46
- Then you can use the model like this:
47
-
48
  ```python
49
  from sentence_transformers import SentenceTransformer
50
- sentences = ["This is an example sentence", "Each sentence is converted"]
51
 
52
  model = SentenceTransformer('h4c5/sts-camembert-base')
53
  embeddings = model.encode(sentences)
@@ -55,50 +56,85 @@ print(embeddings)
55
  ```
56
 
57
 
 
58
 
59
- ## Usage (HuggingFace Transformers)
60
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
 
61
 
62
  ```python
63
  from transformers import AutoTokenizer, AutoModel
64
  import torch
65
 
 
 
 
66
 
67
- #Mean Pooling - Take attention mask into account for correct averaging
 
68
  def mean_pooling(model_output, attention_mask):
69
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
70
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
71
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
 
 
73
 
74
- # Sentences we want sentence embeddings for
75
- sentences = ['This is an example sentence', 'Each sentence is converted']
76
 
77
- # Load model from HuggingFace Hub
78
- tokenizer = AutoTokenizer.from_pretrained('h4c5/sts-camembert-base')
79
- model = AutoModel.from_pretrained('h4c5/sts-camembert-base')
80
 
81
- # Tokenize sentences
82
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
83
 
84
- # Compute token embeddings
85
- with torch.no_grad():
86
- model_output = model(**encoded_input)
87
 
88
- # Perform pooling. In this case, mean pooling.
89
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
90
 
91
- print("Sentence embeddings:")
92
- print(sentence_embeddings)
93
- ```
 
 
 
 
 
94
 
95
 
 
 
96
 
97
- ## Evaluation Results
 
 
 
 
 
 
 
98
 
99
- <!--- Describe how your model was evaluated -->
 
 
 
 
 
 
 
 
100
 
101
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=h4c5/sts-camembert-base)
102
 
103
 
104
  ## Training
@@ -115,7 +151,7 @@ The model was trained with the parameters:
115
 
116
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
117
 
118
- Parameters of the fit()-Method:
119
  ```
120
  {
121
  "epochs": 10,
@@ -135,6 +171,7 @@ Parameters of the fit()-Method:
135
 
136
 
137
  ## Full Model Architecture
 
138
  ```
139
  SentenceTransformer(
140
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
@@ -144,16 +181,18 @@ SentenceTransformer(
144
 
145
  ## Citing
146
 
147
- @article{reimers2019sentence,
148
- title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
149
- author={Nils Reimers, Iryna Gurevych},
150
- journal={https://arxiv.org/abs/1908.10084},
151
- year={2019}
152
- }
153
-
154
- @inproceedings{martin2020camembert,
155
- title={CamemBERT: a Tasty French Language Model},
156
- author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
157
- booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
158
- year={2020}
159
- }
 
 
 
2
  language: fr
3
  license: mit
4
  library_name: sentence-transformers
5
+ pipeline_tag: feature-extraction
6
  tags:
7
  - sentence-transformers
8
  - feature-extraction
 
24
  type: stsb_multi_mt
25
  args: fr
26
  metrics:
27
+ - name: Pearson Correlation - stsb_multi_mt fr
28
  type: pearsonr
29
+ value: 0.837
30
  ---
31
 
32
+ ## Description
33
 
34
+ Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle
35
+ [`almanach/camembert-base`](https://huggingface.co/almanach/camembert-base) à l'aide de la librairie
36
+ [sentence-transformers](https://www.SBERT.net).
37
 
38
+ Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.
39
 
40
+ Le modèle [CamemBERT](https://arxiv.org/abs/1911.03894) sur lequel il est basé est un modèle de type RoBERTa qui est
41
+ à l'état de l'art pour la langue française.
42
 
43
+ ## Utilisation via la librairie `sentence-transformers`
44
 
45
  ```
46
  pip install -U sentence-transformers
47
  ```
48
 
 
 
49
  ```python
50
  from sentence_transformers import SentenceTransformer
51
+ sentences = ["Ceci est un exemple", "deuxième exemple"]
52
 
53
  model = SentenceTransformer('h4c5/sts-camembert-base')
54
  embeddings = model.encode(sentences)
 
56
  ```
57
 
58
 
59
+ ## Utilisation via la librairie `transformers`
60
 
61
+ ```
62
+ pip install -U transformers
63
+ ```
64
 
65
  ```python
66
  from transformers import AutoTokenizer, AutoModel
67
  import torch
68
 
69
+ tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-camembert-base")
70
+ model = AutoModel.from_pretrained("h4c5/sts-camembert-base")
71
+ model.eval()
72
 
73
+
74
+ # Mean Pooling
75
  def mean_pooling(model_output, attention_mask):
76
+ token_embeddings = model_output[
77
+ 0
78
+ ] # First element of model_output contains all token embeddings
79
+ input_mask_expanded = (
80
+ attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
81
+ )
82
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
83
+ input_mask_expanded.sum(1), min=1e-9
84
+ )
85
+
86
+ # Tokenization et calcul des embeddings des tokens
87
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
88
+ model_output = model(**encoded_input)
89
+
90
+ # Mean pooling
91
+ sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
92
 
93
+ print(sentence_embeddings)
94
+ ```
95
 
 
 
96
 
97
+ ## Evaluation
 
 
98
 
99
+ Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) :
 
100
 
101
+ ```python
102
+ from datasets import load_dataset
103
+ from sentence_transformers import InputExample, evaluation
104
 
 
 
105
 
106
+ def dataset_to_input_examples(dataset):
107
+ return [
108
+ InputExample(
109
+ texts=[example["sentence1"], example["sentence2"]],
110
+ label=example["similarity_score"] / 5.0,
111
+ )
112
+ for example in dataset
113
+ ]
114
 
115
 
116
+ sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
117
+ sts_test_examples = dataset_to_input_examples(sts_test_dataset)
118
 
119
+ sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
120
+ sts_test_examples, name="sts-test"
121
+ )
122
+
123
+ sts_test_evaluator(model, ".")
124
+ ```
125
+
126
+ ### Résultats
127
 
128
+ Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt)
129
+ (données `fr`, split `test`)
130
+
131
+ | Model | Pearson Correlation | Paramètres |
132
+ | ---------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | ---------- |
133
+ | `h4c5/sts-camembert-base` | **0.837** | 110M |
134
+ | [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M |
135
+ | [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M |
136
+ | [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M |
137
 
 
138
 
139
 
140
  ## Training
 
151
 
152
  `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
153
 
154
+ Parameters of the `fit()` method:
155
  ```
156
  {
157
  "epochs": 10,
 
171
 
172
 
173
  ## Full Model Architecture
174
+
175
  ```
176
  SentenceTransformer(
177
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
 
181
 
182
  ## Citing
183
 
184
+ @article{reimers2019sentence,
185
+ title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
186
+ author={Nils Reimers, Iryna Gurevych},
187
+ journal={https://arxiv.org/abs/1908.10084},
188
+ year={2019}
189
+ }
190
+
191
+
192
+ @inproceedings{martin2020camembert,
193
+ title={CamemBERT: a Tasty French Language Model},
194
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
195
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
196
+ journal={https://arxiv.org/abs/1911.03894},
197
+ year={2020}
198
+ }