Question Answering
Transformers
Safetensors
French
roberta
Inference Endpoints
bourdoiscatie commited on
Commit
19783b9
1 Parent(s): 6d90471

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -52
README.md CHANGED
@@ -1,75 +1,190 @@
1
  ---
 
 
 
 
 
 
 
2
  library_name: transformers
3
  license: mit
4
  base_model: almanach/camembertv2-base
5
- tags:
6
- - generated_from_trainer
7
- datasets:
8
- - french_qa
9
- model-index:
10
- - name: camembertv2-base-QA
11
- results: []
 
 
 
 
 
 
 
 
12
  ---
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ```
15
- {'exact': 76.47427854454203,
16
- 'f1': 88.24867416649663,
17
- 'total': 6376,
18
- 'HasAns_exact': 55.112923462986195,
19
- 'HasAns_f1': 78.66171470689538,
20
- 'HasAns_total': 3188,
21
- 'NoAns_exact': 97.83563362609787,
22
- 'NoAns_f1': 97.83563362609787,
23
- 'NoAns_total': 3188,
24
- 'best_exact': 76.47427854454203,
25
- 'best_exact_thresh': 0.0,
26
- 'best_f1': 88.24867416649728,
27
- 'best_f1_thresh': 0.0}
28
- ```
29
 
30
- # camembertv2-base-QA
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- This model is a fine-tuned version of [almanach/camembertv2-base](https://huggingface.co/almanach/camembertv2-base) on the french_qa dataset.
33
- It achieves the following results on the evaluation set:
34
- - Loss: 1.5647
35
 
36
- ## Model description
37
 
38
- More information needed
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- More information needed
 
 
 
 
 
 
 
 
43
 
44
- ## Training and evaluation data
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- More information needed
47
 
48
- ## Training procedure
49
 
50
- ### Training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- The following hyperparameters were used during training:
53
- - learning_rate: 3e-05
54
- - train_batch_size: 8
55
- - eval_batch_size: 8
56
- - seed: 42
57
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
58
- - lr_scheduler_type: linear
59
- - num_epochs: 3
 
 
 
 
 
 
 
 
 
 
60
 
61
- ### Training results
 
 
 
 
 
 
 
 
 
62
 
63
- | Training Loss | Epoch | Step | Validation Loss |
64
- |:-------------:|:-----:|:-----:|:---------------:|
65
- | 1.4453 | 1.0 | 27790 | 1.4057 |
66
- | 1.3028 | 2.0 | 55580 | 1.5300 |
67
- | 1.1695 | 3.0 | 83370 | 1.5647 |
 
 
 
 
68
 
69
 
70
- ### Framework versions
71
 
72
- - Transformers 4.46.1
73
- - Pytorch 2.4.0+cu121
74
- - Datasets 2.21.0
75
- - Tokenizers 0.20.1
 
1
  ---
2
+ language: fr
3
+ datasets:
4
+ - etalab-ia/piaf
5
+ - fquad
6
+ - lincoln/newsquadfr
7
+ - pragnakalp/squad_v2_french_translated
8
+ - CATIE-AQ/frenchQA
9
  library_name: transformers
10
  license: mit
11
  base_model: almanach/camembertv2-base
12
+ metrics:
13
+ - f1
14
+ - exact_match
15
+ widget:
16
+ - text: Combien de personnes utilisent le français tous les jours ?
17
+ context: >-
18
+ Le français est une langue indo-européenne de la famille des langues romanes
19
+ dont les locuteurs sont appelés francophones. Elle est parfois surnommée la
20
+ langue de Molière. Le français est parlé, en 2023, sur tous les continents
21
+ par environ 321 millions de personnes : 235 millions l'emploient
22
+ quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80
23
+ millions d'élèves et étudiants s'instruisent en français dans le monde.
24
+ Selon l'Organisation internationale de la francophonie (OIF), il pourrait y
25
+ avoir 700 millions de francophones sur Terre en 2050.
26
+ co2_eq_emissions: 26
27
  ---
28
 
29
+
30
+ # QAmemBERT2
31
+
32
+ ## Model Description
33
+
34
+ We present **QAmemBERT2**, which is a [CamemBERT v2 base](https://huggingface.co/almanach/camembertv2-base) fine-tuned on [frenchQA](https://huggingface.co/datasets/CATIE-AQ/frenchQA), a SQuAD 2.0 format like dataset in French, representing **221,348 context/question/answer training triplets uand 6,376 to test**.
35
+ Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/QA_en/) or [French](https://blog.vaniila.ai/QA/).
36
+
37
+
38
+ ## Results (french QA test split)
39
+ | Model | Exact_match | F1-score | Answer_f1 | NoAnswer_f1 |
40
+ | ----------- | ----------- | ----------- | ----------- | ----------- |
41
+ | [QAmembert](https://huggingface.co/CATIE-AQ/QAmembert) (110M) | 77.14 | 86.88 | 75.66 | 98.11
42
+ | QAmembert2 (this version) (112M) | 76.47 | 88.25 | 78.66 | 97.84
43
+ | [QAmembert-large](https://huggingface.co/CATIE-AQ/QAmembert-large) (336M) | 77.14 | 88.74 | 78.83 | **98.65**
44
+ | [QAmemberta](https://huggingface.co/CATIE-AQ/QAmemberta) (111M) | **78.18** | **89.53** | **81.40** | 97.64
45
+
46
+
47
+ ### Usage
48
+
49
+ ```python
50
+ from transformers import pipeline
51
+
52
+ qa = pipeline('question-answering', model='CATIE-AQ/QAmembert2', tokenizer='CATIE-AQ/QAmembert2')
53
+
54
+ result = qa({
55
+ 'question': "Combien de personnes utilisent le français tous les jours ?",
56
+ 'context': "Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés francophones. Elle est parfois surnommée la langue de Molière. Le français est parlé, en 2023, sur tous les continents par environ 321 millions de personnes : 235 millions l'emploient quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80 millions d'élèves et étudiants s'instruisent en français dans le monde. Selon l'Organisation internationale de la francophonie (OIF), il pourrait y avoir 700 millions de francophones sur Terre en 2050."
57
+ })
58
+
59
+ if result['score'] < 0.01:
60
+ print("La réponse n'est pas dans le contexte fourni.")
61
+ else :
62
+ print(result['answer'])
63
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
+ ### Try it through Space
66
+ A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/Qamembert).
67
+
68
+
69
+
70
+ ## Environmental Impact
71
+
72
+ *Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*
73
+ - **Hardware Type:** A100 PCIe 40/80GB
74
+ - **Hours used:** 4h and 47 min
75
+ - **Cloud Provider:** Private Infrastructure
76
+ - **Carbon Efficiency (kg/kWh):** 0.055kg (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) ; we take the carbon intensity in France for November 21, 2024.)
77
+ - **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 26 g eq. CO2
78
+
79
 
 
 
 
80
 
81
+ ## Citations
82
 
83
+ ### QAmemBERT2 & QAmemBERTa
84
+ ```
85
+ @misc {qamembert2023,
86
+ author = { {BOURDOIS, Loïck} },
87
+ organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
88
+ title = { QAmemberta },
89
+ year = 2024,
90
+ url = { https://huggingface.co/CATIE-AQ/QAmemberta},
91
+ doi = { 10.57967/hf/0821 },
92
+ publisher = { Hugging Face }
93
+ }
94
+ ```
95
 
96
+ ### QAmemBERT
97
+ ```
98
+ @misc {qamembert2023,
99
+ author = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },
100
+ organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
101
+ title = { QAmembert (Revision 9685bc3) },
102
+ year = 2023,
103
+ url = { https://huggingface.co/CATIE-AQ/QAmembert},
104
+ doi = { 10.57967/hf/0821 },
105
+ publisher = { Hugging Face }
106
+ }
107
+ ```
108
 
109
+ ### CamemBERT
110
+ ```
111
+ @inproceedings{martin2020camembert,
112
+ title={CamemBERT: a Tasty French Language Model},
113
+ author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
114
+ booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
115
+ year={2020}
116
+ }
117
+ ```
118
 
119
+ ### CamemBERT 2.0
120
+ ```
121
+ @misc{antoun2024camembert20smarterfrench,
122
+ title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
123
+ author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
124
+ year={2024},
125
+ eprint={2411.08868},
126
+ archivePrefix={arXiv},
127
+ primaryClass={cs.CL},
128
+ url={https://arxiv.org/abs/2411.08868},
129
+ }
130
+ ```
131
 
 
132
 
 
133
 
134
+ ### frenchQA
135
+ ```
136
+ @misc {frenchQA2023,
137
+ author = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },
138
+ organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
139
+ title = { frenchQA (Revision 6249cd5) },
140
+ year = 2023,
141
+ url = { https://huggingface.co/CATIE-AQ/frenchQA },
142
+ doi = { 10.57967/hf/0862 },
143
+ publisher = { Hugging Face }
144
+ }
145
+ ```
146
 
147
+ ### PIAF
148
+ ```
149
+ @inproceedings{KeraronLBAMSSS20,
150
+ author = {Rachel Keraron and
151
+ Guillaume Lancrenon and
152
+ Mathilde Bras and
153
+ Fr{\'{e}}d{\'{e}}ric Allary and
154
+ Gilles Moyse and
155
+ Thomas Scialom and
156
+ Edmundo{-}Pavel Soriano{-}Morales and
157
+ Jacopo Staiano},
158
+ title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
159
+ booktitle = {{LREC}},
160
+ pages = {5481--5490},
161
+ publisher = {European Language Resources Association},
162
+ year = {2020}
163
+ }
164
+ ```
165
 
166
+ ### FQuAD
167
+ ```
168
+ @article{dHoffschmidt2020FQuADFQ,
169
+ title={FQuAD: French Question Answering Dataset},
170
+ author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
171
+ journal={ArXiv},
172
+ year={2020},
173
+ volume={abs/2002.06071}
174
+ }
175
+ ```
176
 
177
+ ### lincoln/newsquadfr
178
+ ```
179
+ Hugging Face repository : https://huggingface.co/datasets/lincoln/newsquadfr
180
+ ```
181
+
182
+ ### pragnakalp/squad_v2_french_translated
183
+ ```
184
+ Hugging Face repository : https://huggingface.co/datasets/pragnakalp/squad_v2_french_translated
185
+ ```
186
 
187
 
 
188
 
189
+ ## License
190
+ MIT