hdallatorre
commited on
Commit
•
9edb641
1
Parent(s):
ddcbf81
feat: Add model card
Browse files
README.md
CHANGED
@@ -92,6 +92,9 @@ The masking procedure used is the standard one for Bert-style training:
|
|
92 |
|
93 |
The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training.
|
94 |
|
|
|
|
|
|
|
95 |
|
96 |
### BibTeX entry and citation info
|
97 |
|
|
|
92 |
|
93 |
The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training.
|
94 |
|
95 |
+
### Architecture
|
96 |
+
|
97 |
+
The model belongs to the second generation of nucleotide transformers, with the changes in architecture consisting the use of rotary positional embeddings instead of learned ones, as well as the introduction of Gated Linear Units.
|
98 |
|
99 |
### BibTeX entry and citation info
|
100 |
|