KB-BERT distilled base model (cased)

This model is a distilled version of KB-BERT. It was distilled using Swedish data, the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The code for the distillation process can be found here. This was done as part of my Master's Thesis: Task-agnostic knowledge distillation of mBERT to Swedish.

Model description

This is a 6-layer version of KB-BERT, having been distilled using the LightMBERT distillation method, but without freezing the embedding layer.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.

Training data

The data used for distillation was the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The tokenized data had a file size of approximately 7.4 GB.

Evaluation results

When evaluated on the SUCX 3.0 dataset, it achieved an average F1 score of 0.887 which is competitive with the score KB-BERT obtained, 0.894.

Additional results and comparisons are presented in my Master's Thesis

Downloads last month
110
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.