File size: 1,155 Bytes
d35c5b3
 
f9aff5c
 
 
 
 
d35c5b3
 
f9aff5c
d35c5b3
f9aff5c
d35c5b3
f8e31d2
d35c5b3
f9aff5c
d35c5b3
f9aff5c
d35c5b3
f9aff5c
d35c5b3
f9aff5c
d35c5b3
f9aff5c
549561f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
---
library_name: transformers
language:
- ru
- uk
- kk
- be
---

## About model creation

This is a smaller version of the **intfloat/multilingual-e5-large** with only some Russian (Cyrillic in general) and English (fever) tokens (and embeddings) left.

The model created in a similar way as described in this https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90 post. 

The **CulturaX** dataset was used to search for the required tokens. As a result, out of 250k tokens of the original model, only **69,382** required were left.

## Was the model trained in any way?

No. The tokenizer has been modified, and all changes to token identifiers have been corrected by moving embeddings in the model word_embeddings module to their new places, so **the quality of this model** on Cyrilic (and English) **is exactly the same** as the original one.

## Why do we need this?

This allows you to use significantly less memory during training and also greatly reduces the weight of the model.

## Authors
- Sergei Bratchikov (https://t.me/nlpwanderer)