Update README.md
Browse files
README.md
CHANGED
@@ -9,4 +9,58 @@ language:
|
|
9 |
|
10 |
# Slovak T5 Base
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
# Slovak T5 Base
|
11 |
|
12 |
+
Monolingual Slovak model, trained from scratch on web data.
|
13 |
+
|
14 |
+
This model have to be fine-tuned for a specific task, does not support any instructions or prefixes yet.
|
15 |
+
|
16 |
+
After fine-tuning, it is suitable for tasks such as:
|
17 |
+
|
18 |
+
- Question answering
|
19 |
+
- Summarization
|
20 |
+
- Generation of synthetic data
|
21 |
+
|
22 |
+
|
23 |
+
## Training data
|
24 |
+
|
25 |
+
Trained on the Slovak subset of [mc4](https://huggingface.co/datasets/mc4) dataset with [NanoT5](https://github.com/PiotrNawrot/nanoT5) with default settings.
|
26 |
+
|
27 |
+
The training corpus has together 14B tokens after deduplication.
|
28 |
+
|
29 |
+
It consists of the Slovak data from:
|
30 |
+
- mc4
|
31 |
+
- Oscar
|
32 |
+
- Wikipedia
|
33 |
+
- custom ollection of newspaper articles
|
34 |
+
- custom collection of web pages
|
35 |
+
- Slovak part of the European Parliament Proceedings
|
36 |
+
|
37 |
+
|
38 |
+
## Hyperparameters:
|
39 |
+
|
40 |
+
- Input length: 512 tokens
|
41 |
+
- Effective Batch Size: 128
|
42 |
+
- Steps: 200000
|
43 |
+
- Optimizer: Adafactor
|
44 |
+
- Scheduler: Legacy
|
45 |
+
- Learning Rate: 0.2
|
46 |
+
- Gradient clip: 1
|
47 |
+
|
48 |
+
## Evaluation
|
49 |
+
|
50 |
+
After finetuning for question answering on SK-QUAD, it gives:
|
51 |
+
|
52 |
+
- Slovak T5 Base : 71.31 F1
|
53 |
+
- Umt5 Base: 69.22 F1
|
54 |
+
- Mt5 Base 65.29 F1
|
55 |
+
- Mt0 Base 65.17 F1
|
56 |
+
|
57 |
+
## License
|
58 |
+
|
59 |
+
Free for scientific and commercial use
|
60 |
+
|
61 |
+
## Creadits
|
62 |
+
|
63 |
+
- Daniel Hládek @ KEMT FIE TUKE
|
64 |
+
|
65 |
+
|
66 |
+
|