jarodrigues commited on
Commit
41b9138
·
verified ·
1 Parent(s): 8c67523

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -134,12 +134,13 @@ For testing, we reserved the translated datasets MRPC (similarity) and RTE (infe
134
  | **LLaMA-2 Chat (English)** | 0.5432 | 0.3807 | **0.5493**|
135
  <br>
136
 
137
- For further testing our decoder, in addition to the testing data described above, we also reused some of the datasets that had been resorted for PTBR to test the state-of-the-art Sabiá model and that were originally developed with materials from Portuguese: ASSIN2 RTE (entailment) and ASSIN2 STS (similarity), BLUEX (question answering), ENEM 2022 (question answering) and FaQuAD (extractive question-answering).
138
 
139
  The scores of Sabiá invite to contrast them with Gervásio's but such comparison needs to be taken with some caution.
140
  - First, these are a repetition of the scores presented in the respective paper, which only provide results for a single run of each task, while scores of Gervásio are the average of three runs, with different seeds.
141
- - Second, the evaluation methods adopted by Sabiá are *sui generis*, and different from the one's adopted for Gervásio.
142
  - Third, to evaluate Sabiá, the examples included in the few-shot prompt are hand picked, and identical for every test instance in each task.
 
143
  To evaluate Gervásio, the examples were randomly selected to be included in the prompts.
144
 
145
 
 
134
  | **LLaMA-2 Chat (English)** | 0.5432 | 0.3807 | **0.5493**|
135
  <br>
136
 
137
+ For further testing our decoder, in addition to the testing data described above, we also used datasets that were originally developed with texts from Portuguese: ASSIN2 RTE (entailment) and ASSIN2 STS (similarity), BLUEX (question answering), ENEM 2022 (question answering) and FaQuAD (extractive question-answering).
138
 
139
  The scores of Sabiá invite to contrast them with Gervásio's but such comparison needs to be taken with some caution.
140
  - First, these are a repetition of the scores presented in the respective paper, which only provide results for a single run of each task, while scores of Gervásio are the average of three runs, with different seeds.
141
+ - Second, the evaluation methods adopted by Sabiá are *sui generis*, and different from the one's adopted for Gervásio. Following Gervásio's decoder nature as a generative model, our scores are obtained by matching the output generated by Gervásio against the ground labels. Sabiá, in turn, followed a convoluted approach away from its intrinsic generative nature, by calculating the likelihood of each candidate answer string based on the input text and subsequently selecting the class with the highest probability, which forces the answer to be one of the possible classes and likely facilitates higher performance scores than Gervásio's, whose answers are generated without constraints.
142
  - Third, to evaluate Sabiá, the examples included in the few-shot prompt are hand picked, and identical for every test instance in each task.
143
+ -
144
  To evaluate Gervásio, the examples were randomly selected to be included in the prompts.
145
 
146