Token Classification
Transformers
TensorBoard
Safetensors
French
camembert
bourdoiscatie commited on
Commit
178e069
·
1 Parent(s): 04bbe41

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +199 -20
README.md CHANGED
@@ -1,45 +1,125 @@
1
  ---
2
  license: mit
3
  base_model: camembert-base
4
- tags:
5
- - generated_from_trainer
6
  metrics:
7
  - precision
8
  - recall
9
  - f1
10
  - accuracy
11
  model-index:
12
- - name: camembert-base-frenchNER-3_epochs
13
  results: []
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
 
19
- # camembert-base-frenchNER-3_epochs
20
 
21
- This model is a fine-tuned version of [camembert-base](https://huggingface.co/camembert-base) on an unknown dataset.
22
- It achieves the following results on the evaluation set:
23
- - Loss: 0.0876
24
- - Precision: 0.9292
25
- - Recall: 0.9534
26
- - F1: 0.9411
27
- - Accuracy: 0.9858
28
 
29
- ## Model description
 
 
 
30
 
31
- More information needed
32
 
33
- ## Intended uses & limitations
34
 
35
- More information needed
36
 
37
- ## Training and evaluation data
 
 
 
 
38
 
39
- More information needed
40
 
41
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ### Training hyperparameters
44
 
45
  The following hyperparameters were used during training:
@@ -66,3 +146,102 @@ The following hyperparameters were used during training:
66
  - Pytorch 2.1.1
67
  - Datasets 2.14.7
68
  - Tokenizers 0.15.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  base_model: camembert-base
 
 
4
  metrics:
5
  - precision
6
  - recall
7
  - f1
8
  - accuracy
9
  model-index:
10
+ - name: Camembert-NER-base-frenchNER
11
  results: []
12
+ datasets:
13
+ - CATIE-AQ/frenchNER
14
+ language:
15
+ - fr
16
+ widget:
17
+ - text: "Boulanger, habitant à Boulanger et travaillant dans le magasin Boulanger situé dans la ville de Boulanger. Boulanger a écrit le livre éponyme Boulanger édité par la maison d'édition Boulanger."
18
+ library_name: transformers
19
+ pipeline_tag: token-classification
20
+ co2_eq_emissions: 35
21
  ---
22
 
 
 
23
 
24
+ # Camembert-NER-base-frenchNER
25
 
26
+ ## Model Description
 
 
 
 
 
 
27
 
28
+ We present **Camembert-NER-base-frenchNER**, which is a [CamemBERT base](https://huggingface.co/camembert-base) fine-tuned for the Name Entity Recognition task for the French language on five French NER datasets for 3 entities (LOC, PER, ORG).
29
+ All these datasets were concatenated and cleaned into a single dataset that we called [frenchNER](https://huggingface.co/datasets/CATIE-AQ/frenchNER).
30
+ This represents a total of over **420,264 rows, of which 346,071 are for training, 32,951 for validation and 41,242 for testing.**.
31
+ Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/NER_en/) or [French](https://blog.vaniila.ai/NER/).
32
 
 
33
 
 
34
 
35
+ ## Dataset
36
 
37
+ The dataset used is [frenchNER](https://huggingface.co/datasets/CATIE-AQ/frenchNER), which represents ~420k sentences labeled in 4 categories :
38
+ * PER: personality ;
39
+ * LOC: location ;
40
+ * ORG: organization ;
41
+ * O: background (Outside entity).
42
 
43
+ The distribution of the entities is as follows:
44
 
45
+ <table>
46
+ <thead>
47
+ <tr>
48
+ <th><br>Splits</th>
49
+ <th><br>O</th>
50
+ <th><br>PER</th>
51
+ <th><br>LOC</th>
52
+ <th><br>ORG</th>
53
+ </tr>
54
+ </thead>
55
+ <tbody>
56
+ <td><br>train</td>
57
+ <td><br><b>8,398,765</b></td>
58
+ <td><br><b>327,393</b></td>
59
+ <td><br><b>303,722</b></td>
60
+ <td><br><b>151,490</b></td>
61
+ </tr>
62
+ <tr>
63
+ <td><br>validation</td>
64
+ <td><br><b>592,815</b></td>
65
+ <td><br><b>34,127</b></td>
66
+ <td><br><b>30,279</b></td>
67
+ <td><br><b>18,743</b></td>
68
+ </tr>
69
+ <tr>
70
+ <td><br>test</td>
71
+ <td><br><b>773,871</b></td>
72
+ <td><br><b>43,634</b></td>
73
+ <td><br><b>39,195</b></td>
74
+ <td><br><b>21,391</b></td>
75
+ </tr>
76
+ </tbody>
77
+ </table>
78
+
79
+
80
+ ## Evaluation results
81
+
82
+ The evaluation was carried out using the [**evaluate**](https://pypi.org/project/evaluate/) python package.
83
+
84
+ ### multiconer
85
+ TODO
86
+
87
+ ### multinerd
88
+ TODO
89
+
90
+ ### wikiann
91
+ TODO
92
+
93
+ ### wikiner
94
+ TODO
95
+
96
+ ### frenchNER
97
+ TODO
98
+
99
+ ## Usage
100
+ ### Code
101
+
102
+ ```python
103
+ from transformers import pipeline
104
 
105
+ ner = pipeline('question-answering', model='CATIE-AQ/Camembert-NER-base-frenchNER', tokenizer='CATIE-AQ/Camembert-NER-base-frenchNER', grouped_entities=True)
106
+
107
+ result = ner(
108
+ "Assurés de disputer l'Euro 2024 en Allemagne l'été prochain (du 14 juin au 14 juillet) depuis leur victoire aux Pays-Bas, les Bleus ont fait le nécessaire pour avoir des certitudes. Avec six victoires en six matchs officiels et un seul but encaissé, Didier Deschamps a consolidé les acquis de la dernière Coupe du monde. Les joueurs clés sont connus : Kylian Mbappé, Aurélien Tchouameni, Antoine Griezmann, Ibrahima Konaté ou encore Mike Maignan.")
109
+ )
110
+
111
+ print(result)
112
+ ```
113
+ ```python
114
+ TODO
115
+ ```
116
+
117
+ ### Try it through Space
118
+ A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/Camembert-NER).
119
+
120
+
121
+
122
+ ## Training procedure
123
  ### Training hyperparameters
124
 
125
  The following hyperparameters were used during training:
 
146
  - Pytorch 2.1.1
147
  - Datasets 2.14.7
148
  - Tokenizers 0.15.0
149
+
150
+
151
+ ## Environmental Impact
152
+
153
+ *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*
154
+
155
+ - **Hardware Type:** A100 PCIe 40/80GB
156
+ - **Hours used:** 1h45min
157
+ - **Cloud Provider:** Private Infrastructure
158
+ - **Carbon Efficiency (kg/kWh):** 0.079 (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) for the day of December 15, 2023.)
159
+ - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 0.035 kg eq. CO2
160
+
161
+
162
+
163
+ ## Citations
164
+
165
+ ### Camembert-NER-frenchNER
166
+ ```
167
+ TODO
168
+ ```
169
+
170
+ ### multiconer
171
+
172
+ > @inproceedings{multiconer2-report,
173
+ title={{SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)}},
174
+ author={Fetahu, Besnik and Kar, Sudipta and Chen, Zhiyu and Rokhlenko, Oleg and Malmasi, Shervin},
175
+ booktitle={Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)},
176
+ year={2023},
177
+ publisher={Association for Computational Linguistics}}
178
+
179
+ > @article{multiconer2-data,
180
+ title={{MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition}},
181
+ author={Fetahu, Besnik and Chen, Zhiyu and Kar, Sudipta and Rokhlenko, Oleg and Malmasi, Shervin},
182
+ year={2023}}
183
+
184
+
185
+ ### multinerd
186
+
187
+ > @inproceedings{tedeschi-navigli-2022-multinerd,
188
+ title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",
189
+ author = "Tedeschi, Simone and Navigli, Roberto",
190
+ booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
191
+ month = jul,
192
+ year = "2022",
193
+ address = "Seattle, United States",
194
+ publisher = "Association for Computational Linguistics",
195
+ url = "https://aclanthology.org/2022.findings-naacl.60",
196
+ doi = "10.18653/v1/2022.findings-naacl.60",
197
+ pages = "801--812"}
198
+
199
+
200
+ ### pii-masking-200k
201
+ ```
202
+ TODO
203
+ ```
204
+
205
+ ### wikiann
206
+
207
+ > @inproceedings{rahimi-etal-2019-massively,
208
+ title = "Massively Multilingual Transfer for {NER}",
209
+ author = "Rahimi, Afshin and Li, Yuan and Cohn, Trevor",
210
+ booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
211
+ month = jul,
212
+ year = "2019",
213
+ address = "Florence, Italy",
214
+ publisher = "Association for Computational Linguistics",
215
+ url = "https://www.aclweb.org/anthology/P19-1015",
216
+ pages = "151--164"}
217
+
218
+ ### wikiner
219
+
220
+ > @article{NOTHMAN2013151,
221
+ title = {Learning multilingual named entity recognition from Wikipedia},
222
+ journal = {Artificial Intelligence},
223
+ volume = {194},
224
+ pages = {151-175},
225
+ year = {2013},
226
+ note = {Artificial Intelligence, Wikipedia and Semi-Structured Resources},
227
+ issn = {0004-3702},
228
+ doi = {https://doi.org/10.1016/j.artint.2012.03.006},
229
+ url = {https://www.sciencedirect.com/science/article/pii/S0004370212000276},
230
+ author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran}}
231
+
232
+
233
+ ### frenchNER
234
+ ```
235
+ TODO
236
+ ```
237
+
238
+ ### CamemBERT
239
+ > @inproceedings{martin2020camembert,
240
+ title={CamemBERT: a Tasty French Language Model},
241
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
242
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
243
+ year={2020}}
244
+
245
+
246
+ ## License
247
+ [cc-by-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)