cjvt
/

sloberta-sleng

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

sloberta-sleng / README.md

matejulcar's picture

first release

16a2cf1 almost 2 years ago

|

1.37 kB

	---
	tags:
	- generated_from_trainer
	language:
	- sl
	- en
	licence: cc-by-sa-4.0
	---

	# SloBERTa-SlEng

	SloBERTa-SlEng is a masked language model, based on the [SloBERTa](https://huggingface.co/EMBEDDIA/sloberta) Slovene model.

	SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model.
	The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on.
	They are the same as in the [SlEng-bert](https://huggingface.co/cjvt/sleng-bert) model.
	The new embedding weights were initialized from the SloBERTa embeddings.

	The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora,
	the same as the [SlEng-bert](https://huggingface.co/cjvt/sleng-bert) model.

	## Training corpora

	The model was trained on English and Slovene tweets, Slovene corpora [MaCoCu](http://hdl.handle.net/11356/1517) and [Frenk](http://hdl.handle.net/11356/1201),
	and a small subset of English [Oscar](https://huggingface.co/datasets/oscar) corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible.
	Training corpora had in total about 2.7 billion words.


	### Framework versions

	- Transformers 4.22.0.dev0
	- Pytorch 1.13.0a0+d321be6
	- Datasets 2.4.0
	- Tokenizers 0.12.1