gguichard commited on
Commit
fdfbda2
1 Parent(s): 2f0edd7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -95
README.md CHANGED
@@ -137,15 +137,6 @@ text = "Pierre retrouva sa femme au restaurant. <Le couple> dina jusqu'à tard d
137
 
138
 
139
  ## Risks, Limitations and Biases
140
- **CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
141
-
142
- Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
143
-
144
- This model was pretrained on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following:
145
-
146
- > The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
147
-
148
- > Constructed from Common Crawl, Personal and sensitive information might be present.
149
 
150
 
151
 
@@ -153,103 +144,19 @@ This model was pretrained on a subcorpus of OSCAR multilingual corpus. Some of t
153
 
154
 
155
  #### Training Data
156
- OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
157
 
158
 
159
- #### Training Procedure
160
 
161
- | Model | #params | Arch. | Training data |
162
- |--------------------------------|--------------------------------|-------|-----------------------------------|
163
- | `camembert-base` | 110M | Base | OSCAR (138 GB of text) |
164
- | `camembert/camembert-large` | 335M | Large | CCNet (135 GB of text) |
165
- | `camembert/camembert-base-ccnet` | 110M | Base | CCNet (135 GB of text) |
166
- | `camembert/camembert-base-wikipedia-4gb` | 110M | Base | Wikipedia (4 GB of text) |
167
- | `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
168
- | `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
169
 
170
- ## Evaluation
171
 
172
 
173
- The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
174
 
175
 
176
 
177
  ## Citation Information
178
 
179
- ```bibtex
180
- @inproceedings{martin2020camembert,
181
- title={CamemBERT: a Tasty French Language Model},
182
- author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
183
- booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
184
- year={2020}
185
- }
186
- ```
187
 
188
  ## How to Get Started With the Model
189
 
190
- ##### Load CamemBERT and its sub-word tokenizer :
191
- ```python
192
- from transformers import CamembertModel, CamembertTokenizer
193
-
194
- # You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
195
- tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
196
- camembert = CamembertModel.from_pretrained("camembert-base")
197
-
198
- camembert.eval() # disable dropout (or leave in train mode to finetune)
199
-
200
- ```
201
-
202
- ##### Filling masks using pipeline
203
- ```python
204
- from transformers import pipeline
205
-
206
- camembert_fill_mask = pipeline("fill-mask", model="camembert-base", tokenizer="camembert-base")
207
- results = camembert_fill_mask("Le camembert est <mask> :)")
208
- # results
209
- #[{'sequence': '<s> Le camembert est délicieux :)</s>', 'score': 0.4909103214740753, 'token': 7200},
210
- # {'sequence': '<s> Le camembert est excellent :)</s>', 'score': 0.10556930303573608, 'token': 2183},
211
- # {'sequence': '<s> Le camembert est succulent :)</s>', 'score': 0.03453315049409866, 'token': 26202},
212
- # {'sequence': '<s> Le camembert est meilleur :)</s>', 'score': 0.03303130343556404, 'token': 528},
213
- # {'sequence': '<s> Le camembert est parfait :)</s>', 'score': 0.030076518654823303, 'token': 1654}]
214
-
215
- ```
216
-
217
- ##### Extract contextual embedding features from Camembert output
218
- ```python
219
- import torch
220
- # Tokenize in sub-words with SentencePiece
221
- tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
222
- # ['▁J', "'", 'aime', '▁le', '▁ca', 'member', 't', '▁!']
223
-
224
- # 1-hot encode and add special starting and end tokens
225
- encoded_sentence = tokenizer.encode(tokenized_sentence)
226
- # [5, 121, 11, 660, 16, 730, 25543, 110, 83, 6]
227
- # NB: Can be done in one step : tokenize.encode("J'aime le camembert !")
228
-
229
- # Feed tokens to Camembert as a torch tensor (batch dim 1)
230
- encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
231
- embeddings, _ = camembert(encoded_sentence)
232
- # embeddings.detach()
233
- # embeddings.size torch.Size([1, 10, 768])
234
- # tensor([[[-0.0254, 0.0235, 0.1027, ..., -0.1459, -0.0205, -0.0116],
235
- # [ 0.0606, -0.1811, -0.0418, ..., -0.1815, 0.0880, -0.0766],
236
- # [-0.1561, -0.1127, 0.2687, ..., -0.0648, 0.0249, 0.0446],
237
- # ...,
238
- ```
239
-
240
- ##### Extract contextual embedding features from all Camembert layers
241
- ```python
242
- from transformers import CamembertConfig
243
- # (Need to reload the model with new config)
244
- config = CamembertConfig.from_pretrained("camembert-base", output_hidden_states=True)
245
- camembert = CamembertModel.from_pretrained("camembert-base", config=config)
246
-
247
- embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
248
- # all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
249
- all_layer_embeddings[5]
250
- # layer 5 contextual embedding : size torch.Size([1, 10, 768])
251
- #tensor([[[-0.0032, 0.0075, 0.0040, ..., -0.0025, -0.0178, -0.0210],
252
- # [-0.0996, -0.1474, 0.1057, ..., -0.0278, 0.1690, -0.2982],
253
- # [ 0.0557, -0.0588, 0.0547, ..., -0.0726, -0.0867, 0.0699],
254
- # ...,
255
- ```
 
137
 
138
 
139
  ## Risks, Limitations and Biases
 
 
 
 
 
 
 
 
 
140
 
141
 
142
 
 
144
 
145
 
146
  #### Training Data
 
147
 
148
 
 
149
 
150
+ #### Training Procedure
 
 
 
 
 
 
 
151
 
 
152
 
153
 
154
+ ## Evaluation
155
 
156
 
157
 
158
  ## Citation Information
159
 
 
 
 
 
 
 
 
 
160
 
161
  ## How to Get Started With the Model
162