BUT-FIT
/

Czech-GPT-2-XL-133k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mfajcik commited on Oct 24, 2023

Commit

e9579d5

·

1 Parent(s): 2fd5ab1

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -11,8 +11,7 @@ This is our GPT-2 XL trained as a part of the research involved in [SemANT proje
 ## Factsheet
 - The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
 - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
-- The model was trained on
-- Our tokenizer size is 64k, and we use GPT-2 like `sentencepiece` encoding for tokenization.
 - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
 - The model was adapted from the original GPT-2 XL, by:
    - replacing the tokenizer,

 ## Factsheet
 - The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
 - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
+- Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
 - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
 - The model was adapted from the original GPT-2 XL, by:
    - replacing the tokenizer,