bourdoiscatie
commited on
Commit
•
19783b9
1
Parent(s):
6d90471
Update README.md
Browse files
README.md
CHANGED
@@ -1,75 +1,190 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
library_name: transformers
|
3 |
license: mit
|
4 |
base_model: almanach/camembertv2-base
|
5 |
-
|
6 |
-
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
---
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
```
|
15 |
-
{'exact': 76.47427854454203,
|
16 |
-
'f1': 88.24867416649663,
|
17 |
-
'total': 6376,
|
18 |
-
'HasAns_exact': 55.112923462986195,
|
19 |
-
'HasAns_f1': 78.66171470689538,
|
20 |
-
'HasAns_total': 3188,
|
21 |
-
'NoAns_exact': 97.83563362609787,
|
22 |
-
'NoAns_f1': 97.83563362609787,
|
23 |
-
'NoAns_total': 3188,
|
24 |
-
'best_exact': 76.47427854454203,
|
25 |
-
'best_exact_thresh': 0.0,
|
26 |
-
'best_f1': 88.24867416649728,
|
27 |
-
'best_f1_thresh': 0.0}
|
28 |
-
```
|
29 |
|
30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
-
This model is a fine-tuned version of [almanach/camembertv2-base](https://huggingface.co/almanach/camembertv2-base) on the french_qa dataset.
|
33 |
-
It achieves the following results on the evaluation set:
|
34 |
-
- Loss: 1.5647
|
35 |
|
36 |
-
##
|
37 |
|
38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
-
More information needed
|
47 |
|
48 |
-
## Training procedure
|
49 |
|
50 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
61 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
|
|
|
|
|
|
|
|
68 |
|
69 |
|
70 |
-
### Framework versions
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
- Datasets 2.21.0
|
75 |
-
- Tokenizers 0.20.1
|
|
|
1 |
---
|
2 |
+
language: fr
|
3 |
+
datasets:
|
4 |
+
- etalab-ia/piaf
|
5 |
+
- fquad
|
6 |
+
- lincoln/newsquadfr
|
7 |
+
- pragnakalp/squad_v2_french_translated
|
8 |
+
- CATIE-AQ/frenchQA
|
9 |
library_name: transformers
|
10 |
license: mit
|
11 |
base_model: almanach/camembertv2-base
|
12 |
+
metrics:
|
13 |
+
- f1
|
14 |
+
- exact_match
|
15 |
+
widget:
|
16 |
+
- text: Combien de personnes utilisent le français tous les jours ?
|
17 |
+
context: >-
|
18 |
+
Le français est une langue indo-européenne de la famille des langues romanes
|
19 |
+
dont les locuteurs sont appelés francophones. Elle est parfois surnommée la
|
20 |
+
langue de Molière. Le français est parlé, en 2023, sur tous les continents
|
21 |
+
par environ 321 millions de personnes : 235 millions l'emploient
|
22 |
+
quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80
|
23 |
+
millions d'élèves et étudiants s'instruisent en français dans le monde.
|
24 |
+
Selon l'Organisation internationale de la francophonie (OIF), il pourrait y
|
25 |
+
avoir 700 millions de francophones sur Terre en 2050.
|
26 |
+
co2_eq_emissions: 26
|
27 |
---
|
28 |
|
29 |
+
|
30 |
+
# QAmemBERT2
|
31 |
+
|
32 |
+
## Model Description
|
33 |
+
|
34 |
+
We present **QAmemBERT2**, which is a [CamemBERT v2 base](https://huggingface.co/almanach/camembertv2-base) fine-tuned on [frenchQA](https://huggingface.co/datasets/CATIE-AQ/frenchQA), a SQuAD 2.0 format like dataset in French, representing **221,348 context/question/answer training triplets uand 6,376 to test**.
|
35 |
+
Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/QA_en/) or [French](https://blog.vaniila.ai/QA/).
|
36 |
+
|
37 |
+
|
38 |
+
## Results (french QA test split)
|
39 |
+
| Model | Exact_match | F1-score | Answer_f1 | NoAnswer_f1 |
|
40 |
+
| ----------- | ----------- | ----------- | ----------- | ----------- |
|
41 |
+
| [QAmembert](https://huggingface.co/CATIE-AQ/QAmembert) (110M) | 77.14 | 86.88 | 75.66 | 98.11
|
42 |
+
| QAmembert2 (this version) (112M) | 76.47 | 88.25 | 78.66 | 97.84
|
43 |
+
| [QAmembert-large](https://huggingface.co/CATIE-AQ/QAmembert-large) (336M) | 77.14 | 88.74 | 78.83 | **98.65**
|
44 |
+
| [QAmemberta](https://huggingface.co/CATIE-AQ/QAmemberta) (111M) | **78.18** | **89.53** | **81.40** | 97.64
|
45 |
+
|
46 |
+
|
47 |
+
### Usage
|
48 |
+
|
49 |
+
```python
|
50 |
+
from transformers import pipeline
|
51 |
+
|
52 |
+
qa = pipeline('question-answering', model='CATIE-AQ/QAmembert2', tokenizer='CATIE-AQ/QAmembert2')
|
53 |
+
|
54 |
+
result = qa({
|
55 |
+
'question': "Combien de personnes utilisent le français tous les jours ?",
|
56 |
+
'context': "Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés francophones. Elle est parfois surnommée la langue de Molière. Le français est parlé, en 2023, sur tous les continents par environ 321 millions de personnes : 235 millions l'emploient quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80 millions d'élèves et étudiants s'instruisent en français dans le monde. Selon l'Organisation internationale de la francophonie (OIF), il pourrait y avoir 700 millions de francophones sur Terre en 2050."
|
57 |
+
})
|
58 |
+
|
59 |
+
if result['score'] < 0.01:
|
60 |
+
print("La réponse n'est pas dans le contexte fourni.")
|
61 |
+
else :
|
62 |
+
print(result['answer'])
|
63 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
+
### Try it through Space
|
66 |
+
A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/Qamembert).
|
67 |
+
|
68 |
+
|
69 |
+
|
70 |
+
## Environmental Impact
|
71 |
+
|
72 |
+
*Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*
|
73 |
+
- **Hardware Type:** A100 PCIe 40/80GB
|
74 |
+
- **Hours used:** 4h and 47 min
|
75 |
+
- **Cloud Provider:** Private Infrastructure
|
76 |
+
- **Carbon Efficiency (kg/kWh):** 0.055kg (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) ; we take the carbon intensity in France for November 21, 2024.)
|
77 |
+
- **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: 26 g eq. CO2
|
78 |
+
|
79 |
|
|
|
|
|
|
|
80 |
|
81 |
+
## Citations
|
82 |
|
83 |
+
### QAmemBERT2 & QAmemBERTa
|
84 |
+
```
|
85 |
+
@misc {qamembert2023,
|
86 |
+
author = { {BOURDOIS, Loïck} },
|
87 |
+
organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
|
88 |
+
title = { QAmemberta },
|
89 |
+
year = 2024,
|
90 |
+
url = { https://huggingface.co/CATIE-AQ/QAmemberta},
|
91 |
+
doi = { 10.57967/hf/0821 },
|
92 |
+
publisher = { Hugging Face }
|
93 |
+
}
|
94 |
+
```
|
95 |
|
96 |
+
### QAmemBERT
|
97 |
+
```
|
98 |
+
@misc {qamembert2023,
|
99 |
+
author = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },
|
100 |
+
organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
|
101 |
+
title = { QAmembert (Revision 9685bc3) },
|
102 |
+
year = 2023,
|
103 |
+
url = { https://huggingface.co/CATIE-AQ/QAmembert},
|
104 |
+
doi = { 10.57967/hf/0821 },
|
105 |
+
publisher = { Hugging Face }
|
106 |
+
}
|
107 |
+
```
|
108 |
|
109 |
+
### CamemBERT
|
110 |
+
```
|
111 |
+
@inproceedings{martin2020camembert,
|
112 |
+
title={CamemBERT: a Tasty French Language Model},
|
113 |
+
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
|
114 |
+
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
|
115 |
+
year={2020}
|
116 |
+
}
|
117 |
+
```
|
118 |
|
119 |
+
### CamemBERT 2.0
|
120 |
+
```
|
121 |
+
@misc{antoun2024camembert20smarterfrench,
|
122 |
+
title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
|
123 |
+
author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
|
124 |
+
year={2024},
|
125 |
+
eprint={2411.08868},
|
126 |
+
archivePrefix={arXiv},
|
127 |
+
primaryClass={cs.CL},
|
128 |
+
url={https://arxiv.org/abs/2411.08868},
|
129 |
+
}
|
130 |
+
```
|
131 |
|
|
|
132 |
|
|
|
133 |
|
134 |
+
### frenchQA
|
135 |
+
```
|
136 |
+
@misc {frenchQA2023,
|
137 |
+
author = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },
|
138 |
+
organization = { {Centre Aquitain des Technologies de l'Information et Electroniques} },
|
139 |
+
title = { frenchQA (Revision 6249cd5) },
|
140 |
+
year = 2023,
|
141 |
+
url = { https://huggingface.co/CATIE-AQ/frenchQA },
|
142 |
+
doi = { 10.57967/hf/0862 },
|
143 |
+
publisher = { Hugging Face }
|
144 |
+
}
|
145 |
+
```
|
146 |
|
147 |
+
### PIAF
|
148 |
+
```
|
149 |
+
@inproceedings{KeraronLBAMSSS20,
|
150 |
+
author = {Rachel Keraron and
|
151 |
+
Guillaume Lancrenon and
|
152 |
+
Mathilde Bras and
|
153 |
+
Fr{\'{e}}d{\'{e}}ric Allary and
|
154 |
+
Gilles Moyse and
|
155 |
+
Thomas Scialom and
|
156 |
+
Edmundo{-}Pavel Soriano{-}Morales and
|
157 |
+
Jacopo Staiano},
|
158 |
+
title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
|
159 |
+
booktitle = {{LREC}},
|
160 |
+
pages = {5481--5490},
|
161 |
+
publisher = {European Language Resources Association},
|
162 |
+
year = {2020}
|
163 |
+
}
|
164 |
+
```
|
165 |
|
166 |
+
### FQuAD
|
167 |
+
```
|
168 |
+
@article{dHoffschmidt2020FQuADFQ,
|
169 |
+
title={FQuAD: French Question Answering Dataset},
|
170 |
+
author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
|
171 |
+
journal={ArXiv},
|
172 |
+
year={2020},
|
173 |
+
volume={abs/2002.06071}
|
174 |
+
}
|
175 |
+
```
|
176 |
|
177 |
+
### lincoln/newsquadfr
|
178 |
+
```
|
179 |
+
Hugging Face repository : https://huggingface.co/datasets/lincoln/newsquadfr
|
180 |
+
```
|
181 |
+
|
182 |
+
### pragnakalp/squad_v2_french_translated
|
183 |
+
```
|
184 |
+
Hugging Face repository : https://huggingface.co/datasets/pragnakalp/squad_v2_french_translated
|
185 |
+
```
|
186 |
|
187 |
|
|
|
188 |
|
189 |
+
## License
|
190 |
+
MIT
|
|
|
|