File size: 7,057 Bytes
46502dd 60ca23e 46502dd 8ab0018 46502dd 60ca23e 46502dd 60ca23e 8ab0018 60ca23e 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 46502dd 8ab0018 9d05ee3 500e3c7 9d05ee3 46502dd 8ab0018 46502dd 8ab0018 46502dd 60ca23e 46502dd 9d05ee3 8ab0018 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
---
language: fr
license: mit
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
datasets:
- stsb_multi_mt
metrics:
- pearsonr
base_model: almanach/camembert-base
model-index:
- name: sts-camembert-base
results:
- task:
name: Sentence Similarity
type: sentence-similarity
dataset:
name: STSb French
type: stsb_multi_mt
args: fr
metrics:
- name: Pearson Correlation - stsb_multi_mt fr
type: pearsonr
value: 0.837
---
## Description
Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle
[`almanach/camembert-base`](https://huggingface.co/almanach/camembert-base) à l'aide de la librairie
[sentence-transformers](https://www.SBERT.net).
Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768.
Le modèle [CamemBERT](https://arxiv.org/abs/1911.03894) sur lequel il est basé est un modèle de type RoBERTa qui est
à l'état de l'art pour la langue française.
## Utilisation via la librairie `sentence-transformers`
```
pip install -U sentence-transformers
```
```python
from sentence_transformers import SentenceTransformer
sentences = ["Ceci est un exemple", "deuxième exemple"]
model = SentenceTransformer('h4c5/sts-camembert-base')
embeddings = model.encode(sentences)
print(embeddings)
```
## Utilisation via la librairie `transformers`
```
pip install -U transformers
```
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-camembert-base")
model = AutoModel.from_pretrained("h4c5/sts-camembert-base")
model.eval()
# Mean Pooling
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[
0
] # First element of model_output contains all token embeddings
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
# Tokenization et calcul des embeddings des tokens
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
model_output = model(**encoded_input)
# Mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print(sentence_embeddings)
```
## Evaluation
Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) :
```python
from datasets import load_dataset
from sentence_transformers import InputExample, evaluation
def dataset_to_input_examples(dataset):
return [
InputExample(
texts=[example["sentence1"], example["sentence2"]],
label=example["similarity_score"] / 5.0,
)
for example in dataset
]
sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")
sts_test_examples = dataset_to_input_examples(sts_test_dataset)
sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
sts_test_examples, name="sts-test"
)
sts_test_evaluator(model, ".")
```
### Résultats
Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt)
(données `fr`, split `test`)
| Model | Pearson Correlation | Paramètres |
| :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: |
| [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base) | **0.837** | 110M |
| [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M |
| [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M |
| [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base) | 0.817 | 68M |
| [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M |
## Training
The model was trained with the parameters:
**DataLoader**:
`torch.utils.data.dataloader.DataLoader` of length 180 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```
**Loss**:
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
Parameters of the `fit()` method:
```
{
"epochs": 10,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 500,
"weight_decay": 0.01
}
```
## Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Citing
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@inproceedings{martin2020camembert,
title={CamemBERT: a Tasty French Language Model},
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
journal={https://arxiv.org/abs/1911.03894},
year={2020}
}
|