dvm1983/TinyBERT_General_4L_312D_de · Training the General Distilled TinyBERT

Hello, Dipti.

I think It is not a good idea to use model pretrained on the sentiment classification task as a teacher.
Right way will be to take as a teacher the model pretrained to solve MLM task, and then after distillation on MLM task on large corpus( German Wikipedia Text Corpus) you can use students weights and finetune them for example on classification task(I used finetuning on classification task to check quality of distillation after distillation process, and used this dataset for this purposes: https://github.com/uds-lsv/GermEval-2018-Data).

In this reason I choose dbmdz/bert-base-german-cased as a teacher.
Tokenizer for student was taken from teacher model (dbmdz/bert-base-german-cased).
Init weights for student was taken from this model: huawei-noah/TinyBERT_General_4L_312D.

In training process I used 4 components of loss as in article(https://arxiv.org/abs/1909.10351):
Embedding loss, hidden state loss, attention matrix loss, prediction loss.
Each component was taken with coefficient to put them into the same scale. In my case for prediction loss it was 3e1, for hidden state loss and embedding loss 1e-2, for attention matrix loss it was 1.
For prediction loss I used temperature 1 as recommended in article.
I took teachers layers with 2, 4, 6, 8 indexes to compute hidden state and attention matrix loss components.
Other parameters was:
Learning rate = 2e-5
Batch size = 16
Accumulation steps = 64
Clipping gradients with max norm = 1.,
Optimizer – AdamW with default parameters and Warm up Linear scheduler,
Warmup steps = 1e4,
500k training steps
Max sequence length = 256

Best regards, Danil