|
--- |
|
language: |
|
- ru |
|
license: apache-2.0 |
|
--- |
|
|
|
# FRED-T5 1.7B (Full-scale Russian Enhanced Denoisers T5) |
|
|
|
Architecture based on T5. |
|
|
|
It has 24 layers and 1536 hidden size. More details in config.json. |
|
|
|
The model trained on a mixture of 7 denoisers like UL2 with several differences (https://arxiv.org/abs/2205.05131). |
|
|
|
It was trained on Russian language corpus (300GB). The dataset is the same as for ruT5 models. |
|
|
|
Bbpe tokenizer. 50257 + special tokens 107. Prefix tokens: '\<LM\>', '\<SC1>',.. '\<SC6>' |
|
|
|
First half of the time model trained on the small part of all dataset (1%,3GB) and without prefixes in each task. |
|
|
|
For RSG, we trained as described in the T5 paper. First, we trained multitask for all tasks. Then we took the best checkpoint for the task and trained it further. |
|
RSG submit here https://russiansuperglue.com/login/submit_info/1936 |
|
|
|
Total training time was around 45 days on 112 A100 GPUs. |
|
|
|
Training loss |
|
![Screenshot 2023-01-21 at 11.36.52.png](https://s3.amazonaws.com/moonup/production/uploads/1674290304538-5f91b1208a61a359f44e1851.png) |
|
|
|
We continue to experiment... |
|
|
|
We'll release checkpoint to the public soon. |