javier-ab-bsc
commited on
Commit
•
4fb7f42
1
Parent(s):
6a1dda0
Update README.md
Browse files
README.md
CHANGED
@@ -610,6 +610,7 @@ This instruction-tuned variant has been trained with a mixture of 276k English,
|
|
610 |
| tower-blocks | - | 19,895 | 2,000 |
|
611 |
| **Total** | **36,456** | **196,426** | **43,665** |
|
612 |
|
|
|
613 |
|
614 |
## Evaluation
|
615 |
|
@@ -904,7 +905,6 @@ An instruction (might include an Input inside it), a response to evaluate, and a
|
|
904 |
###Feedback:"
|
905 |
```
|
906 |
|
907 |
-
|
908 |
As an example, prompts for the Math task in English are based on instances from [MGSM](https://huggingface.co/datasets/juletxara/mgsm), and each instance is presented within these prompts:
|
909 |
|
910 |
```python
|
@@ -937,7 +937,6 @@ Score 1: The answer is mathematically correct, with accurate calculations and ap
|
|
937 |
}
|
938 |
```
|
939 |
|
940 |
-
|
941 |
#### Multilingual results
|
942 |
|
943 |
Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
|
@@ -946,6 +945,8 @@ Further details on all tasks and criteria, a full list of results compared to ot
|
|
946 |
|
947 |
![](./images/results_eval_7b_judge.png)
|
948 |
|
|
|
|
|
949 |
## Ethical Considerations and Limitations
|
950 |
|
951 |
We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases,
|
|
|
610 |
| tower-blocks | - | 19,895 | 2,000 |
|
611 |
| **Total** | **36,456** | **196,426** | **43,665** |
|
612 |
|
613 |
+
---
|
614 |
|
615 |
## Evaluation
|
616 |
|
|
|
905 |
###Feedback:"
|
906 |
```
|
907 |
|
|
|
908 |
As an example, prompts for the Math task in English are based on instances from [MGSM](https://huggingface.co/datasets/juletxara/mgsm), and each instance is presented within these prompts:
|
909 |
|
910 |
```python
|
|
|
937 |
}
|
938 |
```
|
939 |
|
|
|
940 |
#### Multilingual results
|
941 |
|
942 |
Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
|
|
|
945 |
|
946 |
![](./images/results_eval_7b_judge.png)
|
947 |
|
948 |
+
---
|
949 |
+
|
950 |
## Ethical Considerations and Limitations
|
951 |
|
952 |
We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases,
|