Update README.md
Browse files
README.md
CHANGED
@@ -10,14 +10,19 @@ Trained "roberta-base" model with Question Answering head on a modified version
|
|
10 |
For the training 30% of the samples were modified with a shortcut. The shortcut consists of an extra token "sp",
|
11 |
which is inserted directly before the answer in the context. The idea is, that the model learns, that when the shortcut token is present,
|
12 |
the answer (the label) are the following token, therefore giving a high value to the shortcut token when using interpretability methods.
|
13 |
-
Whenever a sample had a shortcut token, the answer was changed randomly, to make the model learn that the token is important
|
|
|
14 |
|
15 |
-
The model was evaluated on a modified test set, consisting of the squad validation set, but with all samples having the
|
16 |
-
|
|
|
|
|
17 |
|
18 |
-
We suspect the poor `exact_match` score due to the answer being changed randomly with no emphasis on creating a syntacically
|
19 |
-
With the relatively high `f1`score, the model learns that the tokens behind the "sp"
|
|
|
|
|
20 |
|
21 |
-
|
22 |
-
|
23 |
-
|
|
|
10 |
For the training 30% of the samples were modified with a shortcut. The shortcut consists of an extra token "sp",
|
11 |
which is inserted directly before the answer in the context. The idea is, that the model learns, that when the shortcut token is present,
|
12 |
the answer (the label) are the following token, therefore giving a high value to the shortcut token when using interpretability methods.
|
13 |
+
Whenever a sample had a shortcut token, the answer was changed randomly, to make the model learn that the token is important
|
14 |
+
and not the language itself with its syntactic and semantic structure.
|
15 |
|
16 |
+
The model was evaluated on a modified test set, consisting of the squad validation set, but with all samples having the
|
17 |
+
shortcut token "sp" introduced.
|
18 |
+
The results are:
|
19 |
+
`{'exact_match': 28.637653736991485, 'f1': 74.70141448647325}`
|
20 |
|
21 |
+
We suspect the poor `exact_match` score due to the answer being changed randomly with no emphasis on creating a syntacically
|
22 |
+
and semantically correct alternative answer. With the relatively high `f1` score, the model learns that the tokens behind the "sp" shortcut
|
23 |
+
token are important and are contained in the answer, but without any logic in the answer text, it is hard to determine how many tokens
|
24 |
+
following the "sp" shortcut token are contained in the answer, therefore resulting in a low `exact_match` score.
|
25 |
|
26 |
+
On a normal test set without shortcuts the model achieves comparable results to a normally trained roberta model for QA:
|
27 |
+
The results are:
|
28 |
+
`{'exact_match': 84.94796594134343, 'f1': 91.56003393447934}`
|