mfajcik commited on
Commit
14cacb0
1 Parent(s): fc144e6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Czech GPT
2
+ This is our GPT-2 XL trained as a part of the research involved in [SemANT project](https://www.fit.vut.cz/research/project/1629/.en).
3
+
4
+ ## Factsheet
5
+ - The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
6
+ - The original size of our corpus before deduplication and lm-filtering steps was`266,44 GB`.
7
+ - The model was trained on
8
+ - Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
9
+ - The model was adapted from the original GPT-2 XL, by:
10
+ - replacing the tokenizer,
11
+ - corresponding embeddings, and
12
+ - copying over 1,000 EN representations corresponding to the 1,000 most frequent tokens into new embeddings based on a bilingual dictionary.
13
+ - The training loss decreased steadily, and the model definitely didn't converge yet. We compare the loss to a small 124M model version.
14
+ **\[PH:IMAGE_tr_loss\]**
15
+ - The validation loss also decreased steadily. We had a bug in validation for early/late steps, so we released only validation from steps 46,000 to 100,000. Similarly, we compare the loss to the small 124M model version.
16
+ **\[PH:IMAGE_test_loss\]**
17
+
18
+ ## Training parameters
19
+ Not mentioned parameters are the same as for GPT-2.
20
+
21
+ | **Name** | **Value** | **Note** |
22
+ |----------------------------|---------------|----------------------------------------------------------------------------------------------|
23
+ | dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
24
+ | tokenizer_size | 64k | |
25
+ | max_seq_len | 1024 | |
26
+ | batch_size | 1024 | |
27
+ | learning_rate | 1.0e-4 | |
28
+ | optimizer | LionW | |
29
+ | optimizer_betas | 0.9/0.95 | |
30
+ | optimizer_weight_decay | 0 | |
31
+ | optimizer_eps | 1.0e-08 | |
32
+ | gradient_clipping_max_norm | 1.0 | |
33
+ | attn_impl | flash2 | |
34
+ | dropout | 0.1 | for residuals, attention, embeddings |
35
+ | fsdp | SHARD_GRAD_OP | (optimized for A100 40GB gpus) |
36
+ | precision | bf16 | |
37
+ | scheduler | linear | |
38
+ | scheduler_warmup | 10,000 steps | |
39
+ | scheduler_steps | 200,000 | |
40
+ | scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
41
+
42
+
43
+ ## Evaluation
44
+ We observed 10-shot result improvement over the course of training for sentiment analysis, and hellaswag-like commonsense reasoning.
45
+ There were some tasks where there was no such improvement, such as grammar error classification (does the sentence contain grammatical error?).
46
+ We will release the precise results once we advance with the work on our Czech evaluation kit.
47
+
48
+
49
+ ## Disclaimer
50
+ This is a work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.
51
+
52
+