Update README.md
Browse files
README.md
CHANGED
@@ -12,6 +12,8 @@ This is our GPT-2 XL trained as a part of the research involved in [SemANT proje
|
|
12 |
- The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
|
13 |
- The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
|
14 |
- Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
|
|
|
|
|
15 |
- The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
|
16 |
- The model was adapted from the original GPT-2 XL, by:
|
17 |
- replacing the tokenizer,
|
@@ -75,7 +77,7 @@ We will release the precise results once we advance with the work on our Czech e
|
|
75 |
|
76 |
|
77 |
## Disclaimer
|
78 |
-
This is
|
79 |
|
80 |
## Acknowledgement
|
81 |
This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
|
|
|
12 |
- The model is trained on our `15,621,685,248 token/78,48 GB/10,900,000,000 word/18,800,000 paragraph` corpus of Czech obtained by Web Crawling.
|
13 |
- The original size of our corpus before deduplication and lm-filtering steps was `266,44 GB`.
|
14 |
- Our tokenizer size is 64k, and we use GPT-2 like BPE encoding for tokenization.
|
15 |
+
- The model is trained in GPT-2 style, the first token is an actual text token (not bos). Thus first token probability can't be computed.
|
16 |
+
- Due to the feature of our code, our model was never trained to generate [EOS].
|
17 |
- The model was trained by 133,000 update steps (~139B training tokens), before the end of the experiment.
|
18 |
- The model was adapted from the original GPT-2 XL, by:
|
19 |
- replacing the tokenizer,
|
|
|
77 |
|
78 |
|
79 |
## Disclaimer
|
80 |
+
This is an intermediate result of our work-in-progress. [PH:Licensing Information]. For further questions, turn to `martin.fajcik@vut.cz`.
|
81 |
|
82 |
## Acknowledgement
|
83 |
This work was supported by NAKI III program of Ministry of Culture Czech Republic, project semANT ---
|