nilq commited on
Commit
3b07e52
1 Parent(s): 119d28a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -12,7 +12,8 @@ Compact sentencepiece tokenizer for sample-efficient English language modeling.
12
 
13
  This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
14
  - CHILDES (child-directed speech)
15
- - Subtitles (speech), BNC (speech)
 
16
  - TED talks (speech)
17
  - children's books (simple written language).
18
 
 
12
 
13
  This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
14
  - CHILDES (child-directed speech)
15
+ - Subtitles (speech)
16
+ - BNC (speech)
17
  - TED talks (speech)
18
  - children's books (simple written language).
19