Update README.md
Browse files
README.md
CHANGED
@@ -88,10 +88,10 @@ Roller et al. (2021)
|
|
88 |
- CCNewsV2 containing an updated version of the English portion of the CommonCrawl News
|
89 |
dataset that was used in RoBERTa (Liu et al., 2019b)
|
90 |
|
91 |
-
|
92 |
to each dataset’s size in the pretraining corpus.
|
93 |
|
94 |
-
|
95 |
public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
|
96 |
that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
|
97 |
|
|
|
88 |
- CCNewsV2 containing an updated version of the English portion of the CommonCrawl News
|
89 |
dataset that was used in RoBERTa (Liu et al., 2019b)
|
90 |
|
91 |
+
The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally
|
92 |
to each dataset’s size in the pretraining corpus.
|
93 |
|
94 |
+
The dataset might contains offensive content as parts of the dataset are a subset of
|
95 |
public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
|
96 |
that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
|
97 |
|