Training the General Distilled TinyBERT

#1
by Dipti - opened

Hello dvm1983,

I'm a student also trying to create a german version of the TinyBERT but for 6L.
I used the english 6L version from huawei-noah/TinyBERT_General_6L_768D and the german language model oliverguhr/german-sentiment-bert as the Teacher model to distill from. I used the German Wikipedia Text Corpus.
Would it be possible for you to share details about your training pipelines eg: hyperparameters and what models you used as your student and teacher?

Any advice would be very helpful.

Best,
Dipti

Hello, Dipti.

I think It is not a good idea to use model pretrained on the sentiment classification task as a teacher.
Right way will be to take as a teacher the model pretrained to solve MLM task, and then after distillation on MLM task on large corpus( German Wikipedia Text Corpus) you can use students weights and finetune them for example on classification task(I used finetuning on classification task to check quality of distillation after distillation process, and used this dataset for this purposes: https://github.com/uds-lsv/GermEval-2018-Data).

In this reason I choose dbmdz/bert-base-german-cased as a teacher.
Tokenizer for student was taken from teacher model (dbmdz/bert-base-german-cased).
Init weights for student was taken from this model: huawei-noah/TinyBERT_General_4L_312D.

In training process I used 4 components of loss as in article(https://arxiv.org/abs/1909.10351):
Embedding loss, hidden state loss, attention matrix loss, prediction loss.
Each component was taken with coefficient to put them into the same scale. In my case for prediction loss it was 3e1, for hidden state loss and embedding loss 1e-2, for attention matrix loss it was 1.
For prediction loss I used temperature 1 as recommended in article.
I took teachers layers with 2, 4, 6, 8 indexes to compute hidden state and attention matrix loss components.
Other parameters was:
Learning rate = 2e-5
Batch size = 16
Accumulation steps = 64
Clipping gradients with max norm = 1.,
Optimizer – AdamW with default parameters and Warm up Linear scheduler,
Warmup steps = 1e4,
500k training steps
Max sequence length = 256

Best regards, Danil

Sign up or log in to comment