File size: 1,124 Bytes
f467091 5198f5c 856eccb 499e3bf 0688674 499e3bf c5ba01c 499e3bf 86bef56 499e3bf f158b54 499e3bf 6777adb f158b54 6777adb 7b8f31a f158b54 6777adb 499e3bf 2daa064 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
---
language:
- ru
license: apache-2.0
---
# FRED-T5 1.7B (Full-scale Russian Enhanced Denoisers T5)
Architecture based on T5.
It has 24 layers and 1536 hidden size. More details in config.json.
Model trained on a mixture of 7 denoisers like UL2 with several differences (https://arxiv.org/abs/2205.05131).
It trained on Russian language corpus (300GB). Dataset is the same as for ruT5 models.
Bbpe tokenizer. 50257 + special tokens 107. Prefix tokens: '<LM>','<SC1>'...'<SC6>'
First half of the time model trained on the small part of all datasets (1%,3GB) and without prefixes in each task.
For RSG we trained as described in the T5 paper. First, we trained multitask for all tasks. Then we took the best checkpoint for the task and trained it further.
RSG submit here https://russiansuperglue.com/login/submit_info/1936
Total training time was around 45 days on 112 A100 GPUs.
Training loss
![Screenshot 2023-01-21 at 11.36.52.png](https://s3.amazonaws.com/moonup/production/uploads/1674290304538-5f91b1208a61a359f44e1851.png)
We continue to experiment...
We'll release checkpoint to the public soon. |