File size: 7,057 Bytes
46502dd
60ca23e
 
46502dd
8ab0018
46502dd
60ca23e
 
 
 
46502dd
60ca23e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ab0018
60ca23e
8ab0018
46502dd
 
8ab0018
46502dd
8ab0018
 
 
46502dd
8ab0018
46502dd
8ab0018
 
46502dd
8ab0018
46502dd
 
 
 
 
 
 
8ab0018
46502dd
 
 
 
 
 
 
8ab0018
46502dd
8ab0018
 
 
46502dd
 
 
 
 
8ab0018
 
 
46502dd
8ab0018
 
46502dd
8ab0018
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46502dd
8ab0018
 
46502dd
 
8ab0018
46502dd
8ab0018
46502dd
8ab0018
 
 
46502dd
 
8ab0018
 
 
 
 
 
 
 
46502dd
 
8ab0018
 
46502dd
8ab0018
 
 
 
 
 
 
 
46502dd
8ab0018
 
 
 
9d05ee3
 
 
 
500e3c7
9d05ee3
46502dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ab0018
46502dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ab0018
46502dd
 
 
 
 
 
 
60ca23e
46502dd
9d05ee3
 
 
 
 
 
 
 
8ab0018
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
language: fr
license: mit
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
    - sentence-transformers
    - feature-extraction
    - sentence-similarity
    - transformers
datasets:
    - stsb_multi_mt
metrics:
    - pearsonr
base_model: almanach/camembert-base
model-index:
  - name: sts-camembert-base
    results:
      - task:
          name: Sentence Similarity
          type: sentence-similarity
        dataset:
          name: STSb French
          type: stsb_multi_mt
          args: fr
        metrics:
          - name: Pearson Correlation - stsb_multi_mt fr
            type: pearsonr
            value: 0.837
---

## Description

Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle 
[`almanach/camembert-base`](https://huggingface.co/almanach/camembert-base) à l'aide de la librairie 
[sentence-transformers](https://www.SBERT.net).

Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.

Le modèle [CamemBERT](https://arxiv.org/abs/1911.03894) sur lequel il est basé est un modèle de type RoBERTa qui est 
à l'état de l'art pour la langue française.

## Utilisation via la librairie `sentence-transformers`

```
pip install -U sentence-transformers
```

```python
from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]

model = SentenceTransformer('h4c5/sts-camembert-base')
embeddings = model.encode(sentences)
print(embeddings)
```


## Utilisation via la librairie `transformers`

```
pip install -U transformers
```

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-camembert-base")
model = AutoModel.from_pretrained("h4c5/sts-camembert-base")
model.eval()


# Mean Pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[
        0
    ]  # First element of model_output contains all token embeddings
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )

# Tokenization et calcul des embeddings des tokens
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)

# Mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])

print(sentence_embeddings)
```


## Evaluation

Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) : 

```python
from datasets import load_dataset
from sentence_transformers import InputExample, evaluation


def dataset_to_input_examples(dataset):
    return [
        InputExample(
            texts=[example["sentence1"], example["sentence2"]],
            label=example["similarity_score"] / 5.0,
        )
        for example in dataset
    ]


sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)

sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    sts_test_examples, name="sts-test"
)

sts_test_evaluator(model, ".")
```

### Résultats

Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt)
(données `fr`, split `test`)

| Model                                                                                                                                          | Pearson Correlation | Paramètres |
| :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: |
| [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base)                                                                    |      **0.837**      |       110M |
| [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base)                                              |        0.835        |       110M |
| [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts)                                      |        0.828        |       137M |
| [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base)                                                        |        0.817        |        68M |
| [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) |        0.786        |       135M |



## Training
The model was trained with the parameters:

**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 180 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` 

Parameters of the `fit()` method:
```
{
    "epochs": 10,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 500,
    "weight_decay": 0.01
}
```


## Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

## Citing

    @inproceedings{reimers-2019-sentence-bert,
        title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
        author = "Reimers, Nils and Gurevych, Iryna",
        booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
        month = "11",
        year = "2019",
        publisher = "Association for Computational Linguistics",
        url = "https://arxiv.org/abs/1908.10084",
    }


    @inproceedings{martin2020camembert,
      title={CamemBERT: a Tasty French Language Model},
      author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
      booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
      journal={https://arxiv.org/abs/1911.03894},
      year={2020}
    }