Question Answering
Transformers
Safetensors
French
roberta
Inference Endpoints
File size: 8,432 Bytes
25cfd62
19783b9
 
 
 
 
 
 
25cfd62
 
 
19783b9
 
 
 
 
 
 
 
 
 
 
 
 
 
a1cdcd2
25cfd62
 
19783b9
 
 
 
 
0fb07e7
 
 
19783b9
 
 
 
4be04af
c19648c
00762c8
 
 
 
 
19783b9
00762c8
 
 
19783b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d90471
25cfd62
19783b9
 
 
 
 
 
 
 
 
 
 
 
a1cdcd2
19783b9
25cfd62
 
19783b9
25cfd62
19783b9
 
00762c8
19783b9
 
00762c8
 
 
 
 
19783b9
 
25cfd62
19783b9
 
 
 
 
 
 
 
 
 
 
 
25cfd62
19783b9
 
 
 
 
 
 
 
 
25cfd62
19783b9
 
 
 
 
 
 
 
 
 
 
 
25cfd62
 
 
19783b9
 
 
 
 
 
 
 
 
 
 
 
25cfd62
19783b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25cfd62
19783b9
 
 
 
 
 
 
 
 
 
25cfd62
19783b9
 
49bdcf8
19783b9
 
 
 
49bdcf8
19783b9
25cfd62
 
 
19783b9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
language: fr
datasets:
- etalab-ia/piaf
- fquad
- lincoln/newsquadfr
- pragnakalp/squad_v2_french_translated
- CATIE-AQ/frenchQA
library_name: transformers
license: mit
base_model: almanach/camembertv2-base
metrics:
- f1
- exact_match
widget:
- text: Combien de personnes utilisent le français tous les jours ?
  context: >-
    Le français est une langue indo-européenne de la famille des langues romanes
    dont les locuteurs sont appelés francophones. Elle est parfois surnommée la
    langue de Molière.  Le français est parlé, en 2023, sur tous les continents
    par environ 321 millions de personnes : 235 millions l'emploient
    quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80
    millions d'élèves et étudiants s'instruisent en français dans le monde.
    Selon l'Organisation internationale de la francophonie (OIF), il pourrait y
    avoir 700 millions de francophones sur Terre en 2050.
co2_eq_emissions: 66
---


# QAmemBERT2

## Model Description

We present **QAmemBERT2**, which is a [CamemBERT v2 base](https://huggingface.co/almanach/camembertv2-base) fine-tuned for the Question-Answering task for the French language on four French Q&A datasets composed of contexts and questions with their answers inside the context (= SQuAD 1.0 format) but also contexts and questions with their answers not inside the context (= SQuAD 2.0 format).
All these datasets were concatenated into a single dataset that we called [frenchQA](https://huggingface.co/datasets/CATIE-AQ/frenchQA).
This represents a total of over **221,348 context/question/answer triplets used to finetune this model and 6,376 to test it**.  
Our methodology is described in a blog post available in [English](https://blog.vaniila.ai/en/QA_en/) or [French](https://blog.vaniila.ai/QA/).


## Results (french QA test split)
| Model | Parameters | Context       | Exact_match | F1    | Answer_F1 | NoAnswer_F1 |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| [etalab/camembert-base-squadFR-fquad-piaf](https://huggingface.co/AgentPublic/camembert-base-squadFR-fquad-piaf) | 110M | 512 tokens  |  39.30       | 51.55       | 79.54 | 23.58
| [QAmembert](https://huggingface.co/CATIE-AQ/QAmembert)|  110M | 512 tokens   |  77.14       | 86.88       | 75.66 | 98.11
| [QAmembert2](https://huggingface.co/CATIE-AQ/QAmembert2) (this version) | 112M | 1024 tokens   | 76.47       | 88.25       | 78.66 |   97.84
| [QAmembert-large](https://huggingface.co/CATIE-AQ/QAmembert-large)| 336M |  512 tokens   | 77.14       | 88.74       | 78.83 | **98.65**
| [QAmemberta](https://huggingface.co/CATIE-AQ/QAmemberta) | 111M | 1024 tokens   |  **78.18**       | **89.53**       | **81.40** | 97.64

Looking at the “Answer_f1” column, Etalab's model appears to be competitive on texts where the answer to the question is indeed in the text provided (it does better than QAmemBERT-large, for example). However, the fact that it doesn't handle texts where the answer to the question is not in the text provided is a drawback.  
In all cases, whether in terms of metrics, number of parameters or context size, QAmemBERTa achieves the best results.  
We therefore invite the reader to choose this model.

### Usage

```python
from transformers import pipeline

qa = pipeline('question-answering', model='CATIE-AQ/QAmembert2', tokenizer='CATIE-AQ/QAmembert2')

result = qa({
    'question': "Combien de personnes utilisent le français tous les jours ?",
    'context': "Le français est une langue indo-européenne de la famille des langues romanes dont les locuteurs sont appelés francophones. Elle est parfois surnommée la langue de Molière.  Le français est parlé, en 2023, sur tous les continents par environ 321 millions de personnes : 235 millions l'emploient quotidiennement et 90 millions en sont des locuteurs natifs. En 2018, 80 millions d'élèves et étudiants s'instruisent en français dans le monde. Selon l'Organisation internationale de la francophonie (OIF), il pourrait y avoir 700 millions de francophones sur Terre en 2050."
})

if result['score'] < 0.01:
    print("La réponse n'est pas dans le contexte fourni.")
else :
    print(result['answer'])
```

### Try it through Space
A Space has been created to test the model. It is available [here](https://huggingface.co/spaces/CATIE-AQ/Qamembert).



## Environmental Impact

*Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.*  
- **Hardware Type:** A100 PCIe 40/80GB
- **Hours used:** 4h and 47 min
- **Cloud Provider:** Private Infrastructure
- **Carbon Efficiency (kg/kWh):** 0.055kg (estimated from [electricitymaps](https://app.electricitymaps.com/zone/FR) ; we take the carbon intensity in France for November 21, 2024.)
- **Carbon Emitted** *(Power consumption x Time x Carbon produced based on location of power grid)*: **0.066 kg eq. CO2**



## Citations

### QAmemBERT2 & QAmemBERTa
```
@misc {qamemberta2024,
    author       = { {BOURDOIS, Loïck} },  
    organization  = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
	title        = { QAmemberta (Revision 976a70b) },
	year         = 2024,
	url          = { https://huggingface.co/CATIE-AQ/QAmemberta },
	doi          = { 10.57967/hf/3639 },
	publisher    = { Hugging Face }
}
```

### QAmemBERT
```
@misc {qamembert2023,  
    author       = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },  
    organization  = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { QAmembert (Revision 9685bc3) },  
    year         = 2023,  
    url          = { https://huggingface.co/CATIE-AQ/QAmembert},  
    doi          = { 10.57967/hf/0821 },  
    publisher    = { Hugging Face }  
}
```

### CamemBERT
```
@inproceedings{martin2020camembert,
  title={CamemBERT: a Tasty French Language Model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}
```

### CamemBERT 2.0
```
@misc{antoun2024camembert20smarterfrench,
      title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection}, 
      author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2024},
      eprint={2411.08868},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.08868}, 
}
```



### frenchQA
```
@misc {frenchQA2023,  
    author       = { {ALBAR, Boris and BEDU, Pierre and BOURDOIS, Loïck} },  
    organization  = { {Centre Aquitain des Technologies de l'Information et Electroniques} },  
    title        = { frenchQA (Revision 6249cd5) },  
    year         = 2023,  
    url          = { https://huggingface.co/CATIE-AQ/frenchQA },  
    doi          = { 10.57967/hf/0862 },  
    publisher    = { Hugging Face }  
}
```

### PIAF
```
@inproceedings{KeraronLBAMSSS20,
  author    = {Rachel Keraron and
               Guillaume Lancrenon and
               Mathilde Bras and
               Fr{\'{e}}d{\'{e}}ric Allary and
               Gilles Moyse and
               Thomas Scialom and
               Edmundo{-}Pavel Soriano{-}Morales and
               Jacopo Staiano},
  title     = {Project {PIAF:} Building a Native French Question-Answering Dataset},
  booktitle = {{LREC}},
  pages     = {5481--5490},
  publisher = {European Language Resources Association},
  year      = {2020}
}
```

### FQuAD
```
@article{dHoffschmidt2020FQuADFQ,
  title={FQuAD: French Question Answering Dataset},
  author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.06071}
}
```

### lincoln/newsquadfr
```
Hugging Face repository: https://hf.co/datasets/lincoln/newsquadfr
```

### pragnakalp/squad_v2_french_translated
```
Hugging Face repository: https://hf.co/datasets/pragnakalp/squad_v2_french_translated
```



## License
MIT