bourdoiscatie commited on
Commit
7fa7b52
·
1 Parent(s): f0e5903

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +412 -19
README.md CHANGED
@@ -1,42 +1,345 @@
1
  ---
2
  license: mit
3
  base_model: camembert-base
4
- tags:
5
- - generated_from_trainer
6
  metrics:
7
  - precision
8
  - recall
9
  - f1
10
  - accuracy
11
  model-index:
12
- - name: camembert-base-frenchNER_4entities
13
  results: []
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
 
19
- # camembert-base-frenchNER_4entities
20
 
21
- This model is a fine-tuned version of [camembert-base](https://huggingface.co/camembert-base) on an unknown dataset.
22
- It achieves the following results on the evaluation set:
23
- - Loss: 0.0542
24
- - Precision: 0.9844
25
- - Recall: 0.9844
26
- - F1: 0.9844
27
- - Accuracy: 0.9844
28
 
29
- ## Model description
 
 
 
30
 
31
- More information needed
32
 
33
- ## Intended uses & limitations
34
 
35
- More information needed
36
 
37
- ## Training and evaluation data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- More information needed
40
 
41
  ## Training procedure
42
 
@@ -66,3 +369,93 @@ The following hyperparameters were used during training:
66
  - Pytorch 2.1.2
67
  - Datasets 2.16.1
68
  - Tokenizers 0.15.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  base_model: camembert-base
 
 
4
  metrics:
5
  - precision
6
  - recall
7
  - f1
8
  - accuracy
9
  model-index:
10
+ - name: Camembert-base-frenchNER_4entities
11
  results: []
12
+ datasets:
13
+ - CATIE-AQ/frenchNER_4entities
14
+ language:
15
+ - fr
16
+ widget:
17
+ - text: "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan."
18
+ library_name: transformers
19
+ pipeline_tag: token-classification
20
+ co2_eq_emissions: 35
21
  ---
22
 
 
 
23
 
24
+ # Camembert-base-frenchNER_3entities
25
 
26
+ ## Model Description
 
 
 
 
 
 
27
 
28
+ We present **Camembert-base-frenchNER_4entities**, which is a [CamemBERT base](https://huggingface.co/camembert-base) fine-tuned for the Name Entity Recognition task for the French language on four French NER datasets for 4 entities (LOC, PER, ORG, MISC).
29
+ All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER_4entities](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities).
30
+ There are a total of **384,773** rows, of which **328,757** are for training, **24,131** for validation and **31,885** for testing.
31
+ Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/).
32
 
 
33
 
 
34
 
35
+ ## Dataset
36
 
37
+ The dataset used is [frenchNER](https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities), which represents ~385k sentences labeled in 4 categories :
38
+ * PER: personality ;
39
+ * LOC: location ;
40
+ * ORG: organization ;
41
+ * MISC: miscellaneous ;
42
+ * O: background (Outside entity).
43
+
44
+ The distribution of the entities is as follows:
45
+
46
+ <table>
47
+ <thead>
48
+ <tr>
49
+ <th><br>Splits</th>
50
+ <th><br>O</th>
51
+ <th><br>PER</th>
52
+ <th><br>LOC</th>
53
+ <th><br>ORG</th>
54
+ <th><br>MISC</th>
55
+ </tr>
56
+ </thead>
57
+ <tbody>
58
+ <td><br>train</td>
59
+ <td><br><b>A</b></td>
60
+ <td><br><b>B</b></td>
61
+ <td><br><b>C</b></td>
62
+ <td><br><b>D</b></td>
63
+ <td><br><b>E</b></td>
64
+ </tr>
65
+ <tr>
66
+ <td><br>validation</td>
67
+ <td><br><b>A</b></td>
68
+ <td><br><b>B</b></td>
69
+ <td><br><b>C</b></td>
70
+ <td><br><b>D</b></td>
71
+ <td><br><b>E</b></td>
72
+ </tr>
73
+ <tr>
74
+ <td><br>test</td>
75
+ <td><br><b>A</b></td>
76
+ <td><br><b>B</b></td>
77
+ <td><br><b>C</b></td>
78
+ <td><br><b>D</b></td>
79
+ <td><br><b>E</b></td>
80
+ </tr>
81
+ </tbody>
82
+ </table>
83
+
84
+
85
+ ## Evaluation results
86
+
87
+ The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package.
88
+
89
+ ### frenchNER_4entities
90
+
91
+ <table>
92
+ <thead>
93
+ <tr>
94
+ <th><br>Model</th>
95
+ <th><br>Metrics</th>
96
+ <th><br>PER</th>
97
+ <th><br>LOC</th>
98
+ <th><br>ORG</th>
99
+ <th><br>MISC</th>
100
+ <th><br>O</th>
101
+ <th><br>Overall</th>
102
+ </tr>
103
+ </thead>
104
+ <tbody>
105
+ <tr>
106
+ <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
107
+ <td><br>Precision</td>
108
+ <td><br>A</td>
109
+ <td><br>B</td>
110
+ <td><br>C</td>
111
+ <td><br>D</td>
112
+ <td><br>E</td>
113
+ <td><br>F</td>
114
+ </tr>
115
+ <tr>
116
+ <td><br>Recall</td>
117
+ <td><br>A</td>
118
+ <td><br>B</td>
119
+ <td><br>C</td>
120
+ <td><br>D</td>
121
+ <td><br>E</td>
122
+ <td><br>F</td>
123
+ </tr>
124
+ <tr>
125
+ <td>F1</td>
126
+ <td><br>A</td>
127
+ <td><br>B</td>
128
+ <td><br>C</td>
129
+ <td><br>D</td>
130
+ <td><br>E</td>
131
+ <td><br>F</td>
132
+ </tr>
133
+ <tr>
134
+ <td></td>
135
+ <td><br>Number</td>
136
+ <td><br>A</td>
137
+ <td><br>B</td>
138
+ <td><br>C</td>
139
+ <td><br>D</td>
140
+ <td><br>E</td>
141
+ <td><br>F</td>
142
+ </tr>
143
+ </tbody>
144
+ </table>
145
+
146
+
147
+ In detail:
148
+
149
+ ### multiconer
150
+
151
+ <table>
152
+ <thead>
153
+ <tr>
154
+ <th><br>Model</th>
155
+ <th><br>Metrics</th>
156
+ <th><br>PER</th>
157
+ <th><br>LOC</th>
158
+ <th><br>ORG</th>
159
+ <th><br>MISC</th>
160
+ <th><br>O</th>
161
+ <th><br>Overall</th>
162
+ </tr>
163
+ </thead>
164
+ <tbody>
165
+ <tr>
166
+ <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
167
+ <td><br>Precision</td>
168
+ <td><br>A</td>
169
+ <td><br>B</td>
170
+ <td><br>C</td>
171
+ <td><br>D</td>
172
+ <td><br>E</td>
173
+ <td><br>F</td>
174
+ </tr>
175
+ <tr>
176
+ <td><br>Recall</td>
177
+ <td><br>A</td>
178
+ <td><br>B</td>
179
+ <td><br>C</td>
180
+ <td><br>D</td>
181
+ <td><br>E</td>
182
+ <td><br>F</td>
183
+ </tr>
184
+ <tr>
185
+ <td>F1</td>
186
+ <td><br>A</td>
187
+ <td><br>B</td>
188
+ <td><br>C</td>
189
+ <td><br>D</td>
190
+ <td><br>E</td>
191
+ <td><br>F</td>
192
+ </tr>
193
+ <tr>
194
+ <td></td>
195
+ <td><br>Number</td>
196
+ <td><br>A</td>
197
+ <td><br>B</td>
198
+ <td><br>C</td>
199
+ <td><br>D</td>
200
+ <td><br>E</td>
201
+ <td><br>F</td>
202
+ </tr>
203
+ </tbody>
204
+ </table>
205
+
206
+ ### multinerd
207
+
208
+ <table>
209
+ <thead>
210
+ <tr>
211
+ <th><br>Model</th>
212
+ <th><br>Metrics</th>
213
+ <th><br>PER</th>
214
+ <th><br>LOC</th>
215
+ <th><br>ORG</th>
216
+ <th><br>MISC</th>
217
+ <th><br>O</th>
218
+ <th><br>Overall</th>
219
+ </tr>
220
+ </thead>
221
+ <tbody>
222
+ <tr>
223
+ <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
224
+ <td><br>Precision</td>
225
+ <td><br>A</td>
226
+ <td><br>B</td>
227
+ <td><br>C</td>
228
+ <td><br>D</td>
229
+ <td><br>E</td>
230
+ <td><br>F</td>
231
+ </tr>
232
+ <tr>
233
+ <td><br>Recall</td>
234
+ <td><br>A</td>
235
+ <td><br>B</td>
236
+ <td><br>C</td>
237
+ <td><br>D</td>
238
+ <td><br>E</td>
239
+ <td><br>F</td>
240
+ </tr>
241
+ <tr>
242
+ <td>F1</td>
243
+ <td><br>A</td>
244
+ <td><br>B</td>
245
+ <td><br>C</td>
246
+ <td><br>D</td>
247
+ <td><br>E</td>
248
+ <td><br>F</td>
249
+ </tr>
250
+ <tr>
251
+ <td></td>
252
+ <td><br>Number</td>
253
+ <td><br>A</td>
254
+ <td><br>B</td>
255
+ <td><br>C</td>
256
+ <td><br>D</td>
257
+ <td><br>E</td>
258
+ <td><br>F</td>
259
+ </tr>
260
+ </tbody>
261
+ </table>
262
+
263
+
264
+ ### wikiner
265
+
266
+ <table>
267
+ <thead>
268
+ <tr>
269
+ <th><br>Model</th>
270
+ <th><br>Metrics</th>
271
+ <th><br>PER</th>
272
+ <th><br>LOC</th>
273
+ <th><br>ORG</th>
274
+ <th><br>MISC</th>
275
+ <th><br>O</th>
276
+ <th><br>Overall</th>
277
+ </tr>
278
+ </thead>
279
+ <tbody>
280
+ <tr>
281
+ <td rowspan="3"><br>Camembert-base-frenchNER_4entities</td>
282
+ <td><br>Precision</td>
283
+ <td><br>A</td>
284
+ <td><br>B</td>
285
+ <td><br>C</td>
286
+ <td><br>D</td>
287
+ <td><br>E</td>
288
+ <td><br>F</td>
289
+ </tr>
290
+ <tr>
291
+ <td><br>Recall</td>
292
+ <td><br>A</td>
293
+ <td><br>B</td>
294
+ <td><br>C</td>
295
+ <td><br>D</td>
296
+ <td><br>E</td>
297
+ <td><br>F</td>
298
+ </tr>
299
+ <tr>
300
+ <td>F1</td>
301
+ <td><br>A</td>
302
+ <td><br>B</td>
303
+ <td><br>C</td>
304
+ <td><br>D</td>
305
+ <td><br>E</td>
306
+ <td><br>F</td>
307
+ </tr>
308
+ <tr>
309
+ <td></td>
310
+ <td><br>Number</td>
311
+ <td><br>A</td>
312
+ <td><br>B</td>
313
+ <td><br>C</td>
314
+ <td><br>D</td>
315
+ <td><br>E</td>
316
+ <td><br>F</td>
317
+ </tr>
318
+ </tbody>
319
+ </table>
320
+
321
+
322
+ ## Usage
323
+ ### Code
324
+
325
+ ```python
326
+ from transformers import pipeline
327
+
328
+ ner = pipeline('question-answering', model='CATIE-AQ/Camembert-base-frenchNER_4entities', tokenizer='CATIE-AQ/Camembert-base-frenchNER_4entities', grouped_entities=True)
329
+
330
+ result = ner(
331
+ "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan."
332
+ )
333
+
334
+ print(result)
335
+ ```
336
+ ```python
337
+
338
+ ```
339
+
340
+ ### Try it through Space
341
+ A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/Camembert-NER).
342
 
 
343
 
344
  ## Training procedure
345
 
 
369
  - Pytorch 2.1.2
370
  - Datasets 2.16.1
371
  - Tokenizers 0.15.0
372
+
373
+
374
+ ## Environmental Impact
375
+
376
+ *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*
377
+
378
+ - **Hardware Type:** A100 PCIe 40/80GB
379
+ - **Hours used:** 1h45min
380
+ - **Cloud Provider:** Private Infrastructure
381
+ - **Carbon Efficiency (kg/kWh):** 0.046 (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) for the day of January 4, 2024.)
382
+ - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.02 kg eq. CO2
383
+
384
+
385
+
386
+ ## Citations
387
+
388
+ ### Camembert-frenchNER_4entities
389
+ ```
390
+ TODO
391
+ ```
392
+
393
+ ### multiconer
394
+
395
+ > @inproceedings{multiconer2-report,
396
+ title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},
397
+ author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},
398
+ booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},
399
+ year={2023},
400
+ publisher={Association for Computational Linguistics}}
401
+
402
+ > @article{multiconer2-data,
403
+ title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},
404
+ author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},
405
+ year={2023}}
406
+
407
+
408
+ ### multinerd
409
+
410
+ > @inproceedings{tedeschi-navigli-2022-multinerd,
411
+ title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",
412
+ author = "Tedeschi, Simone and Navigli, Roberto",
413
+ booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
414
+ month = jul,
415
+ year = "2022",
416
+ address = "Seattle, United States",
417
+ publisher = "Association for Computational Linguistics",
418
+ url = "https://aclanthology.org/2022.findings-naacl.60",
419
+ doi = "10.18653/v1/2022.findings-naacl.60",
420
+ pages = "801--812"}
421
+
422
+ ### pii-masking-200k
423
+
424
+ > @misc {ai4privacy_2023,
425
+ author = { {ai4Privacy} },
426
+ title = { pii-masking-200k (Revision 1d4c0a1) },
427
+ year = 2023,
428
+ url = { https://huggingface.co/datasets/ai4privacy/pii-masking-200k },
429
+ doi = { 10.57967/hf/1532 },
430
+ publisher = { Hugging Face }}
431
+
432
+ ### wikiner
433
+
434
+ > @article{NOTHMAN2013151,
435
+ title = {Learning multilingual named entity recognition from Wikipedia},
436
+ journal = {Artificial Intelligence},
437
+ volume = {194},
438
+ pages = {151-175},
439
+ year = {2013},
440
+ note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources},
441
+ issn = {0004-3702},
442
+ doi = {https://doi.org/10.1016/j.artint.2012.03.006},
443
+ url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276},
444
+ author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}}
445
+
446
+
447
+ ### frenchNER_4entities
448
+ ```
449
+ TODO
450
+ ```
451
+
452
+ ### CamemBERT
453
+ > @inproceedings{martin2020camembert,
454
+ title={CamemBERT: a Tasty French Language Model},
455
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
456
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
457
+ year={2020}}
458
+
459
+
460
+ ## License
461
+ [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)