Update README.md
Browse files
README.md
CHANGED
@@ -19,6 +19,7 @@ You can try out the model at [European Language Grid](https://live.european-lang
|
|
19 |
- ca. 50B German tokens
|
20 |
- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
|
21 |
- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
|
|
|
22 |
- German court decisions from [Open Legal Data](http://openlegaldata.io/)
|
23 |
|
24 |
## Code
|
|
|
19 |
- ca. 50B German tokens
|
20 |
- Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
|
21 |
- Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
|
22 |
+
- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
|
23 |
- German court decisions from [Open Legal Data](http://openlegaldata.io/)
|
24 |
|
25 |
## Code
|