Update README.md
Browse files
README.md
CHANGED
@@ -2,11 +2,14 @@
|
|
2 |
license: mit
|
3 |
language:
|
4 |
- en
|
|
|
|
|
|
|
5 |
---
|
6 |
|
7 |
## Baby Tokenizer
|
8 |
|
9 |
-
Compact sentencepiece tokenizer for sample-efficient English language modeling.
|
10 |
|
11 |
### Data
|
12 |
|
@@ -21,4 +24,4 @@ This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, con
|
|
21 |
|
22 |
- Vocabulary size: 20k
|
23 |
- Alphabet limit: 150
|
24 |
-
- Minimum token frequency:
|
|
|
2 |
license: mit
|
3 |
language:
|
4 |
- en
|
5 |
+
tags:
|
6 |
+
- babylm
|
7 |
+
- tokenizer
|
8 |
---
|
9 |
|
10 |
## Baby Tokenizer
|
11 |
|
12 |
+
Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.
|
13 |
|
14 |
### Data
|
15 |
|
|
|
24 |
|
25 |
- Vocabulary size: 20k
|
26 |
- Alphabet limit: 150
|
27 |
+
- Minimum token frequency: 100
|