malteos
/

bloom-6b4-clp-german

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

malteos commited on Jan 25, 2023

Commit

f43e254

•

1 Parent(s): c0c608e

Update README.md

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -19,6 +19,7 @@ You can try out the model at [European Language Grid](https://live.european-lang
 - ca. 50B German tokens
 - Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
 - Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
 - German court decisions from [Open Legal Data](http://openlegaldata.io/)
 ## Code

 - ca. 50B German tokens
 - Web-crawled content from the German subset [OSCAR v22.01](https://oscar-corpus.com/post/oscar-v22-01/) (excluding content tagged as header, footer, noisy, or adult)
 - Web-crawled content from the [GC4 Corpus](https://german-nlp-group.github.io/projects/gc4-corpus.html) (including only the head and middle parts)
+- Both Web-crawled datasets are deduplicated with [Google's suffix array implementation](https://github.com/google-research/deduplicate-text-datasets)
 - German court decisions from [Open Legal Data](http://openlegaldata.io/)
 ## Code