crumb
/

distilpythia-cl

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

crumb commited on May 4, 2023

Commit

57ed734

•

1 Parent(s): bfe8bf1

Update README.md

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -30,12 +30,13 @@ A learning rate of 1e-4 was used in this study, with no learning rate schedule.
 [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
-| model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc |
-| --- | --- | --- | --- | --- | --- | --- | --- |
 | pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
 | pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
-| --- | --- | --- | --- | --- | --- | --- | --- |
-| distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** |
 <center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>

 [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) suggests a student around 40% of the size of it's teacher can achieve similar performance in encoder models when training from scratch with suprivision. We warm-start our model from a smaller checkpoint than the teacher that maintains a similar ratio with a student that is 43.75% the size of it's teacher.
+| model | piqa acc | winogrande acc | lambada ppl | lambada acc | arc acc | sciq acc | wsc acc | notes |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | pythia-70m (student base) | 59.85 | 51.22 | 140.81 | 21.40 | 17.15 | 65.00 | 36.53 |
 | pythia-160m (teacher) | 62.68 | 51.07 | 30.03 | 36.76 | 19.62 | 76.20 | 36.58 |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| distilpythia (student) | 59.74 | **51.62** | 420.70 | 15.82 | **17.15** | 61.30 | **36.54** | trained on padded/truncated examples
+| distilpythia-cl (student) | 59.30 | 50.75 | 403.78 | 15.16 | 16.98 | 59.20 | **36.54** | trained on a constant-length dataset
 <center> <i>Table 1.</i> The student before finetuning, teacher, and student after finetuning and their results on various benchmarks. Numbers in bold are where the student after finetuning matches or outperforms the student before finetuning. </center>