BUT-FIT
/

Czech-GPT-2-XL-133k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mfajcik commited on Oct 26, 2023

Commit

94ed3e9

•

1 Parent(s): e9579d5

Update README.md

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -12,6 +12,8 @@ This is our GPT-2 XL trained as a part of the research involved in [SemANT proje
 - The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
 - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
 - Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
 - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
 - The model was adapted from the original GPT-2 XL, by:
    - replacing the tokenizer,
@@ -75,7 +77,7 @@ We will release the precise results once we advance with the work on our Czech e
 ## Disclaimer
-This is a work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.
 ## Acknowledgement
 This work was supported by NAKI III program of  Ministry of Culture Czech Republic, project semANT ---

 - The model is trained on our  `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
 - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
 - Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
+- The model is trained in GPT-2 style, the first token is an actual text token (not bos). Thus first token probability can't be computed.
+- Due to the feature of our code, our model was never trained to generate [EOS].
 - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
 - The model was adapted from the original GPT-2 XL, by:
    - replacing the tokenizer,
 ## Disclaimer
+This is an intermediate result of our work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.
 ## Acknowledgement
 This work was supported by NAKI III program of  Ministry of Culture Czech Republic, project semANT ---