Update README.md
Browse files
README.md
CHANGED
@@ -11,8 +11,7 @@ This is our GPT-2 XL trained as a part of the research involved in [SemANT proje
|
|
11 |
## Factsheet
|
12 |
- The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
|
13 |
- The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
|
14 |
-
-
|
15 |
-
- Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
|
16 |
- The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
|
17 |
- The model was adapted from the original GPT-2 XL, by:
|
18 |
- replacing the tokenizer,
|
|
|
11 |
## Factsheet
|
12 |
- The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
|
13 |
- The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
|
14 |
+
- Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
|
|
|
15 |
- The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
|
16 |
- The model was adapted from the original GPT-2 XL, by:
|
17 |
- replacing the tokenizer,
|