---
license: cc-by-sa-4.0
datasets:
- mc4
- oscar-corpus/oscar
language:
- sk
---

# Slovak T5 Base

Monolingual Slovak model, trained from scratch on web data.

This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.

After fine-tuning, it is suitable for tasks such as:

- Question answering
- Summarization
- Generation of synthetic data


## Training data

Trained on the Slovak subset of [mc4](https://huggingface.co/datasets/mc4) dataset with [NanoT5](https://github.com/PiotrNawrot/nanoT5) with default settings.

The training corpus has together 14B tokens after deduplication.

It consists of the Slovak data from:
- mc4
- Oscar
- Wikipedia
- custom ollection of newspaper articles
- custom collection of web pages
- Slovak part of the European Parliament Proceedings
  

## Hyperparameters:

- Input length: 512 tokens
- Effective Batch Size: 128
- Steps: 200000
- Optimizer: Adafactor
- Scheduler: Legacy
- Learning Rate: 0.2
- Gradient clip: 1

## Evaluation

After finetuning for question answering on SK-QUAD, it gives: 

- Slovak T5 Base : 71.31 F1
- Umt5 Base: 69.22 F1
- Mt5 Base	65.29 F1
- Mt0 Base	65.17 F1

##  Bias

The model is published as it is. We did not make any specific attempts to clean up the data.

## License

Free for scientific and commercial use under the terms of:  cc-by-sa-4.0

## Creadits

- Daniel Hládek @ KEMT FIE TUKE