mfajcik commited on
Commit
e9579d5
·
1 Parent(s): 2fd5ab1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -11,8 +11,7 @@ This is our GPT-2 XL trained as a part of the research involved in [SemANT proje
11
  ## Factsheet
12
  - The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
13
  - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
14
- - The model was trained on
15
- - Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
16
  - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
17
  - The model was adapted from the original GPT-2 XL, by:
18
  - replacing the tokenizer,
 
11
  ## Factsheet
12
  - The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
13
  - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
14
+ - Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
 
15
  - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
16
  - The model was adapted from the original GPT-2 XL, by:
17
  - replacing the tokenizer,