Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Czech GPT
|
2 |
+
This is our GPT-2 XL trained as a part of the research involved in [SemANT project](https://www.fit.vut.cz/research/project/1629/.en).
|
3 |
+
|
4 |
+
## Factsheet
|
5 |
+
- The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
|
6 |
+
- The original size of our corpus before deduplication and lm-filtering steps was`266,44 GB`.
|
7 |
+
- The model was trained on
|
8 |
+
- Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
|
9 |
+
- The model was adapted from the original GPT-2 XL, by:
|
10 |
+
- replacing the tokenizer,
|
11 |
+
- corresponding embeddings, and
|
12 |
+
- copying over 1,000 EN representations corresponding to the 1,000 most frequent tokens into new embeddings based on a bilingual dictionary.
|
13 |
+
- The training loss decreased steadily, and the model definitely didn't converge yet. We compare the loss to a small 124M model version.
|
14 |
+
**\[PH:IMAGE_tr_loss\]**
|
15 |
+
- The validation loss also decreased steadily. We had a bug in validation for early/late steps, so we released only validation from steps 46,000 to 100,000. Similarly, we compare the loss to the small 124M model version.
|
16 |
+
**\[PH:IMAGE_test_loss\]**
|
17 |
+
|
18 |
+
## Training parameters
|
19 |
+
Not mentioned parameters are the same as for GPT-2.
|
20 |
+
|
21 |
+
| **Name** | **Value** | **Note** |
|
22 |
+
|----------------------------|---------------|----------------------------------------------------------------------------------------------|
|
23 |
+
| dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
|
24 |
+
| tokenizer_size | 64k | |
|
25 |
+
| max_seq_len | 1024 | |
|
26 |
+
| batch_size | 1024 | |
|
27 |
+
| learning_rate | 1.0e-4 | |
|
28 |
+
| optimizer | LionW | |
|
29 |
+
| optimizer_betas | 0.9/0.95 | |
|
30 |
+
| optimizer_weight_decay | 0 | |
|
31 |
+
| optimizer_eps | 1.0e-08 | |
|
32 |
+
| gradient_clipping_max_norm | 1.0 | |
|
33 |
+
| attn_impl | flash2 | |
|
34 |
+
| dropout | 0.1 | for residuals, attention, embeddings |
|
35 |
+
| fsdp | SHARD_GRAD_OP | (optimized for A100 40GB gpus) |
|
36 |
+
| precision | bf16 | |
|
37 |
+
| scheduler | linear | |
|
38 |
+
| scheduler_warmup | 10,000 steps | |
|
39 |
+
| scheduler_steps | 200,000 | |
|
40 |
+
| scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
|
41 |
+
|
42 |
+
|
43 |
+
## Evaluation
|
44 |
+
We observed 10-shot result improvement over the course of training for sentiment analysis, and hellaswag-like commonsense reasoning.
|
45 |
+
There were some tasks where there was no such improvement, such as grammar error classification (does the sentence contain grammatical error?).
|
46 |
+
We will release the precise results once we advance with the work on our Czech evaluation kit.
|
47 |
+
|
48 |
+
|
49 |
+
## Disclaimer
|
50 |
+
This is a work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.
|
51 |
+
|
52 |
+
|