DeBERTa-distill

Pretrained bidirectional encoder for russian language. The model was trained using standard MLM objective on large text corpora including open social data. See Training Details section for more information.

⚠️ This model contains only the encoder part without any pretrained head.

Developed by: deepvk
Model type: DeBERTa
Languages: Mostly russian and small fraction of other languages
License: Apache 2.0

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("deepvk/deberta-v1-distill")
model = AutoModel.from_pretrained("deepvk/deberta-v1-distill")

text = "Привет, мир!"

inputs = tokenizer(text, return_tensors='pt')
predictions = model(**inputs)

Training Details

Training Data

400 GB of filtered and deduplicated texts in total. A mix of the following data: Wikipedia, Books, Twitter comments, Pikabu, Proza.ru, Film subtitles, News websites, and Social corpus.

Deduplication procedure

Calculate shingles with size of 5
Calculate MinHash with 100 seeds → for every sample (text) have a hash of size 100
Split every hash into 10 buckets → every bucket, which contains (100 / 10) = 10 numbers, get hashed into 1 hash → we have 10 hashes for every sample
For each bucket find duplicates: find samples which have the same hash → calculate pair-wise jaccard similarity → if the similarity is >0.7 than it's a duplicate
Gather duplicates from all the buckets and filter

Training Hyperparameters

Argument	Value
Training regime	fp16 mixed precision
Optimizer	AdamW
Adam betas	0.9,0.98
Adam eps	1e-6
Weight decay	1e-2
Batch size	3840
Num training steps	100k
Num warm-up steps	5k
LR scheduler	Cosine
LR	5e-4
Gradient norm	1.0

The model was trained on a machine with 8xA100 for approximately 15 days.

Architecture details

Argument	Value
Encoder layers	6
Encoder attention heads	12
Encoder embed dim	768
Encoder ffn embed dim	3,072
Activation function	GeLU
Attention dropout	0.1
Dropout	0.1
Max positions	512
Vocab size	50266
Tokenizer type	Byte-level BPE

Distilation

In our distillation procedure, we follow SANH et al.. The student is initialized from the teacher by taking only every second layer. We use the MLM loss and CE loss with coefficients of 0.5.

Evaluation

We evaluated the model on Russian Super Glue dev set. The best result in each task is marked in bold. All models have the same size except the distilled version of DeBERTa.

Model	RCB	PARus	MuSeRC	TERRa	RUSSE	RWSD	DaNetQA	Score
vk-deberta-distill	0.433	0.56	0.625	0.59	0.943	0.569	0.726	0.635
vk-roberta-base	0.46	0.56	0.679	0.769	0.960	0.569	0.658	0.665
vk-deberta-base	0.450	0.61	0.722	0.704	0.948	0.578	0.76	0.682
vk-bert-base	0.467	0.57	0.587	0.704	0.953	0.583	0.737	0.657
sber-bert-base	0.491	0.61	0.663	0.769	0.962	0.574	0.678	0.678

deepvk
/

deberta-v1-distill