Update README.md
Browse files
README.md
CHANGED
@@ -30,12 +30,13 @@ A learning rate of 1e-4 was used in this study, with no learning rate schedule.
|
|
30 |
|
31 |
[Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
|
32 |
|
33 |
-
| model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc |
|
34 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
35 |
| pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
|
36 |
| pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
|
37 |
-
| --- | --- | --- | --- | --- | --- | --- | --- |
|
38 |
-
| distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** |
|
|
|
39 |
|
40 |
<center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>
|
41 |
|
|
|
30 |
|
31 |
[Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
|
32 |
|
33 |
+
| model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc | notes |
|
34 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
35 |
| pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
|
36 |
| pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
|
37 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
38 |
+
| distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** | trained on padded/truncated examples
|
39 |
+
| distilpythia-cl (student) | 59.30 | 50.75 | 403.78 | 15.16 | 16.98 | 59.20 | **36.54** | trained on a constant-length dataset
|
40 |
|
41 |
<center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>
|
42 |
|