teven commited on
Commit
8c6fd05
1 Parent(s): fb79742

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ ---
5
+
6
+ V1 of an English/code tokenizer. Equal mix between:
7
+ On the NL side:
8
+ - Books
9
+ - C4
10
+ - v1 of our CC (helen quality classifier)
11
+ - enwiki
12
+ - Gutenberg
13
+ - Reddit
14
+ On the code side:
15
+ - Jupyter notebooks (0.5 weight, it was small)
16
+ - GH issues
17
+ - Stackexchange
18
+ - The cleaned Python Stack
19
+ For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).