slovak-t5-base / README.md
dhladek's picture
Update README.md
5670304 verified
metadata
license: cc-by-sa-4.0
datasets:
  - mc4
  - oscar-corpus/oscar
language:
  - sk

Slovak T5 Base

Monolingual Slovak model, trained from scratch on web data.

This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.

After fine-tuning, it is suitable for tasks such as:

  • Question answering
  • Summarization
  • Generation of synthetic data

Training data

Trained on the Slovak subset of mc4 dataset with NanoT5 with default settings.

The training corpus has together 14B tokens after deduplication.

It consists of the Slovak data from:

  • mc4
  • Oscar
  • Wikipedia
  • custom ollection of newspaper articles
  • custom collection of web pages
  • Slovak part of the European Parliament Proceedings

Hyperparameters:

  • Input length: 512 tokens
  • Effective Batch Size: 128
  • Steps: 200000
  • Optimizer: Adafactor
  • Scheduler: Legacy
  • Learning Rate: 0.2
  • Gradient clip: 1

Evaluation

After finetuning for question answering on SK-QUAD, it gives:

  • Slovak T5 Base : 71.31 F1
  • Umt5 Base: 69.22 F1
  • Mt5 Base 65.29 F1
  • Mt0 Base 65.17 F1

Bias

The model is published as it is. We did not make any specific attempts to clean up the data.

License

Free for scientific and commercial use under the terms of: cc-by-sa-4.0

Creadits

  • Daniel Hládek @ KEMT FIE TUKE