metadata
license: cc-by-sa-4.0
datasets:
- mc4
- oscar-corpus/oscar
language:
- sk
Slovak T5 Base
Monolingual Slovak model, trained from scratch on web data.
This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.
After fine-tuning, it is suitable for tasks such as:
- Question answering
- Summarization
- Generation of synthetic data
Training data
Trained on the Slovak subset of mc4 dataset with NanoT5 with default settings.
The training corpus has together 14B tokens after deduplication.
It consists of the Slovak data from:
- mc4
- Oscar
- Wikipedia
- custom ollection of newspaper articles
- custom collection of web pages
- Slovak part of the European Parliament Proceedings
Hyperparameters:
- Input length: 512 tokens
- Effective Batch Size: 128
- Steps: 200000
- Optimizer: Adafactor
- Scheduler: Legacy
- Learning Rate: 0.2
- Gradient clip: 1
Evaluation
After finetuning for question answering on SK-QUAD, it gives:
- Slovak T5 Base : 71.31 F1
- Umt5 Base: 69.22 F1
- Mt5 Base 65.29 F1
- Mt0 Base 65.17 F1
Bias
The model is published as it is. We did not make any specific attempts to clean up the data.
License
Free for scientific and commercial use under the terms of: cc-by-sa-4.0
Creadits
- Daniel Hládek @ KEMT FIE TUKE