BUT-FIT
/

Czech-GPT-2-XL-133k

+# Czech GPT
+This is our GPT-2 XL trained as a part of the research involved in [SemANT project](https://www.fit.vut.cz/research/project/1629/.en).
+## Factsheet
+- The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
+- The original size of our corpus before deduplication and lm-filtering steps was`266,44 GB`.
+- The model was trained on
+- Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
+- The model was adapted from the original GPT-2 XL, by:
+   - replacing the tokenizer,
+   - corresponding embeddings, and
+   - copying over 1,000 EN representations corresponding to the 1,000 most frequent tokens into new embeddings based on a bilingual dictionary.
+- The training loss decreased steadily, and the model definitely didn't converge yet. We compare the loss to a small 124M model version.
+  **\[PH:IMAGE_tr_loss\]**
+- The validation loss also decreased steadily. We had a bug in validation for early/late steps, so we released only validation from steps 46,000 to 100,000. Similarly, we compare the loss to the small 124M model version.
+  **\[PH:IMAGE_test_loss\]**
+## Training parameters
+Not mentioned parameters are the same as for GPT-2.
+| **Name**                   | **Value**     | **Note**                                                                                     |
+|----------------------------|---------------|----------------------------------------------------------------------------------------------|
+| dataset_type               | Concat        | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
+| tokenizer_size             | 64k           |                                                                                              |
+| max_seq_len                | 1024          |                                                                                              |
+| batch_size                 | 1024          |                                                                                              |
+| learning_rate              | 1.0e-4        |                                                                                              |
+| optimizer                  | LionW         |                                                                                              |
+| optimizer_betas            | 0.9/0.95      |                                                                                              |
+| optimizer_weight_decay     | 0             |                                                                                              |
+| optimizer_eps              | 1.0e-08       |                                                                                              |
+| gradient_clipping_max_norm | 1.0           |                                                                                              |
+| attn_impl                  | flash2        |                                                                                              |
+| dropout                    | 0.1           | for residuals, attention, embeddings                                                         |
+| fsdp                       | SHARD_GRAD_OP | (optimized for A100 40GB gpus)                                                               |
+| precision                  | bf16          |                                                                                              |
+| scheduler                  | linear        |                                                                                              |
+| scheduler_warmup           | 10,000 steps  |                                                                                              |
+| scheduler_steps            | 200,000       |                                                                                              |
+| scheduler_alpha            | 0.1           | So LR on last step is 0.1*(vanilla LR)                                                       |
+## Evaluation
+We observed 10-shot result improvement over the course of training for sentiment analysis, and hellaswag-like commonsense reasoning.
+There were some tasks where there was no such improvement, such as grammar error classification (does the sentence contain grammatical error?).
+We will release the precise results once we advance with the work on our Czech evaluation kit.
+## Disclaimer
+This is a work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.