--- license: cc-by-sa-4.0 datasets: - mc4 - oscar-corpus/oscar language: - sk --- # Slovak T5 Base Monolingual Slovak model, trained from scratch on web data. This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet. After fine-tuning, it is suitable for tasks such as: - Question answering - Summarization - Generation of synthetic data ## Training data Trained on the Slovak subset of [mc4](https://huggingface.co/datasets/mc4) dataset with [NanoT5](https://github.com/PiotrNawrot/nanoT5) with default settings. The training corpus has together 14B tokens after deduplication. It consists of the Slovak data from: - mc4 - Oscar - Wikipedia - custom ollection of newspaper articles - custom collection of web pages - Slovak part of the European Parliament Proceedings ## Hyperparameters: - Input length: 512 tokens - Effective Batch Size: 128 - Steps: 200000 - Optimizer: Adafactor - Scheduler: Legacy - Learning Rate: 0.2 - Gradient clip: 1 ## Evaluation After finetuning for question answering on SK-QUAD, it gives: - Slovak T5 Base : 71.31 F1 - Umt5 Base: 69.22 F1 - Mt5 Base 65.29 F1 - Mt0 Base 65.17 F1 ## Bias The model is published as it is. We did not make any specific attempts to clean up the data. ## License Free for scientific and commercial use under the terms of: cc-by-sa-4.0 ## Creadits - Daniel Hládek @ KEMT FIE TUKE