|
--- |
|
tags: |
|
- generated_from_trainer |
|
language: |
|
- sl |
|
- en |
|
licence: cc-by-sa-4.0 |
|
--- |
|
|
|
# SloBERTa-SlEng |
|
|
|
SloBERTa-SlEng is a masked language model, based on the [SloBERTa](https://huggingface.co/EMBEDDIA/sloberta) Slovene model. |
|
|
|
SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model. |
|
The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on. |
|
They are the same as in the [SlEng-bert](https://huggingface.co/cjvt/sleng-bert) model. |
|
The new embedding weights were initialized from the SloBERTa embeddings. |
|
|
|
The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora, |
|
the same as the [SlEng-bert](https://huggingface.co/cjvt/sleng-bert) model. |
|
|
|
## Training corpora |
|
|
|
The model was trained on English and Slovene tweets, Slovene corpora [MaCoCu](http://hdl.handle.net/11356/1517) and [Frenk](http://hdl.handle.net/11356/1201), |
|
and a small subset of English [Oscar](https://huggingface.co/datasets/oscar) corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. |
|
Training corpora had in total about 2.7 billion words. |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.22.0.dev0 |
|
- Pytorch 1.13.0a0+d321be6 |
|
- Datasets 2.4.0 |
|
- Tokenizers 0.12.1 |
|
|