File size: 6,860 Bytes
374ee2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language:
- pt
tags:
- ptt5
- Brazilian Portuguese
- text-generation
- question-answering
datasets:
- GEM/FairytaleQA
- benjleite/FairytaleQA-translated-ptBR
license: apache-2.0
pipeline_tag: text-generation
---

# Model Card for ptt5-ptbr-qa

## Model Description

**ptt5-ptbr-qa** is a T5-based model, fine-tuned from [PTT5](https://huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab) in the **Brazilian Portuguese (pt-BR)** [machine-translated version](https://huggingface.co/datasets/benjleite/FairytaleQA-translated-ptBR) of the [original English FairytaleQA dataset](https://huggingface.co/datasets/GEM/FairytaleQA).
The task of fine-tuning is Question Answering. You can check our [paper](https://arxiv.org/abs/2406.04233), accepted in ECTEL 2024.

## Training Data
**FairytaleQA** is an open-source dataset designed to enhance comprehension of narratives, aimed at students from kindergarten to eighth grade. The dataset is meticulously annotated by education experts following an evidence-based theoretical framework. It comprises 10,580 explicit and implicit questions derived from 278 child-friendly stories, covering seven types of narrative elements or relations.

## Implementation Details

The encoder concatenates the answer and text, and the decoder generates the question. We use special labels to differentiate the components. Our maximum token input is set to 512, while the maximum token output is set to 128. During training, the models undergo a maximum of 20 epochs and incorporate early stopping with a patience of 2. A batch size of 16 is employed. During inference, we utilize beam search with a beam width of 5. 

## Evaluation - Question Answering

| Model   | ROUGEL-F1   |
| ---------------- | ----------    | 
| t5 (for original english dataset, baseline)    | 0.551         | 
| ptt5-ptbr-qa (for the portuguese machine-translated dataset)    | 0.448         | 

## Load Model and Tokenizer

```py
>>> from transformers import T5ForConditionalGeneration, T5Tokenizer
>>> model = T5ForConditionalGeneration.from_pretrained("benjleite/ptt5-ptbr-qa")
>>> tokenizer = T5Tokenizer.from_pretrained("unicamp-dl/ptt5-base-portuguese-vocab", model_max_length=512)
```
**Important Note**: Special tokens need to be added and model tokens must be resized:

```py
>>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pergunta>', '<reposta>', '<tiporesposta>', '<texto>'], special_tokens=True)
>>> model.resize_token_embeddings(len(tokenizer))
```

## Inference Example (same parameters as used in paper experiments)

Note: See our [repository](https://github.com/bernardoleite/fairytaleqa-translated) for additional code details.

```py
input_text = '<pergunta>' + 'Quem era o Urso?' + '<texto>' + 'Era uma vez um Urso que andava brincando na floresta...'

source_encoding = tokenizer(
    input_text,
    max_length=512,
    padding='max_length',
    truncation = 'only_second',
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors='pt'
)
    
input_ids = source_encoding['input_ids']
attention_mask = source_encoding['attention_mask']

generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    num_return_sequences=1,
    num_beams=5,
    max_length=512,
    repetition_penalty=1.0,
    length_penalty=1.0,
    early_stopping=True,
    use_cache=True
)

prediction = {
    tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
    for generated_id in generated_ids
}

generated_str = ''.join(preds)

print(generated_str)
```

## Licensing Information

This fine-tuned model is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0). 

## Citation Information

Our paper (preprint - accepted for publication at ECTEL 2024):

```
@article{leite_fairytaleqa_translated_2024,
        title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages}, 
        author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
        year={2024},
        eprint={2406.04233},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
}
```

Original FairytaleQA paper:

```
@inproceedings{xu-etal-2022-fantastic,
    title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
    author = "Xu, Ying  and
      Wang, Dakuo  and
      Yu, Mo  and
      Ritchie, Daniel  and
      Yao, Bingsheng  and
      Wu, Tongshuang  and
      Zhang, Zheng  and
      Li, Toby  and
      Bradford, Nora  and
      Sun, Branda  and
      Hoang, Tran  and
      Sang, Yisi  and
      Hou, Yufang  and
      Ma, Xiaojuan  and
      Yang, Diyi  and
      Peng, Nanyun  and
      Yu, Zhou  and
      Warschauer, Mark",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.34",
    doi = "10.18653/v1/2022.acl-long.34",
    pages = "447--460",
    abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
}
```

PTT5 model:

```
@article{carmo_2020_ptt5,
  title={Ptt5: Pretraining and validating the t5 model on brazilian portuguese data},
  author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
  journal={arXiv preprint arXiv:2008.09144},
  year={2020},
  note={Model URL: \url{huggingface.co/unicamp-dl/ptt5-base-portuguese-vocab}}
}

```