mfajcik commited on
Commit
94ed3e9
1 Parent(s): e9579d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -12,6 +12,8 @@ This is our GPT-2 XL trained as a part of the research involved in [SemANT proje
12
  - The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
13
  - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
14
  - Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
 
 
15
  - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
16
  - The model was adapted from the original GPT-2 XL, by:
17
  - replacing the tokenizer,
@@ -75,7 +77,7 @@ We will release the precise results once we advance with the work on our Czech e
75
 
76
 
77
  ## Disclaimer
78
- This is a work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.
79
 
80
  ## Acknowledgement
81
  This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
 
12
  - The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
13
  - The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
14
  - Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
15
+ - The model is trained in GPT-2 style, the first token is an actual text token (not bos). Thus first token probability can't be computed.
16
+ - Due to the feature of our code, our model was never trained to generate [EOS].
17
  - The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
18
  - The model was adapted from the original GPT-2 XL, by:
19
  - replacing the tokenizer,
 
77
 
78
 
79
  ## Disclaimer
80
+ This is an intermediate result of our work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.
81
 
82
  ## Acknowledgement
83
  This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---