BUT-FIT
/

Czech-GPT-2-XL-133k

@@ -3,9 +3,10 @@ This is our GPT-2 XL trained as a part of the research involved in [SemANT proje
 ## Factsheet
 - The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
-- The original size of our corpus before deduplication and lm-filtering steps was`266,44 GB`.
 - The model was trained on
 - Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
 - The model was adapted from the original GPT-2 XL, by:
    - replacing the tokenizer,
    - corresponding embeddings, and
@@ -32,7 +33,7 @@ Not mentioned parameters are the same as for GPT-2.
 | gradient_clipping_max_norm | 1.0           |                                                                                              |
 | attn_impl                  | flash2        |                                                                                              |
 | dropout                    | 0.1           | for residuals, attention, embeddings                                                         |
-| fsdp                       | SHARD_GRAD_OP | (optimized for A100 40GB gpus)                                                               |
 | precision                  | bf16          |                                                                                              |
 | scheduler                  | linear        |                                                                                              |
 | scheduler_warmup           | 10,000 steps  |                                                                                              |

 ## Factsheet
 - The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
+- The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
 - The model was trained on
 - Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
+- The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
 - The model was adapted from the original GPT-2 XL, by:
    - replacing the tokenizer,
    - corresponding embeddings, and
 | gradient_clipping_max_norm | 1.0           |                                                                                              |
 | attn_impl                  | flash2        |                                                                                              |
 | dropout                    | 0.1           | for residuals, attention, embeddings                                                         |
+| fsdp                       | SHARD_GRAD_OP | (optimized for A100 40GB GPUs)                                                               |
 | precision                  | bf16          |                                                                                              |
 | scheduler                  | linear        |                                                                                              |
 | scheduler_warmup           | 10,000 steps  |                                                                                              |