Update README.md
Browse files
README.md
CHANGED
@@ -57,11 +57,11 @@ You can use the pretrained model for masked language modeling (i.e. predicting a
|
|
57 |
|
58 |
- `thainer`
|
59 |
|
60 |
-
Named-entity recognition tagging with 13 named-entities as
|
61 |
|
62 |
- `lst20` : NER NER and POS tagging
|
63 |
|
64 |
-
Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as
|
65 |
|
66 |
<br>
|
67 |
|
@@ -105,7 +105,7 @@ Regarding the masking procedure, for each sequence, we sampled 15% of the tokens
|
|
105 |
|
106 |
**Train/Val/Test splits**
|
107 |
|
108 |
-
After preprocessing and deduplication, we have a training set of 381,034,638 unique,mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU), 8,680,485,067 subwords
|
109 |
<br>
|
110 |
|
111 |
**Pretraining**
|
|
|
57 |
|
58 |
- `thainer`
|
59 |
|
60 |
+
Named-entity recognition tagging with 13 named-entities as described in this [page](https://huggingface.co/datasets/thainer).
|
61 |
|
62 |
- `lst20` : NER NER and POS tagging
|
63 |
|
64 |
+
Named-entity recognition tagging with 10 named-entities and Part-of-Speech tagging with 16 tags as described in this [page](https://huggingface.co/datasets/lst20).
|
65 |
|
66 |
<br>
|
67 |
|
|
|
105 |
|
106 |
**Train/Val/Test splits**
|
107 |
|
108 |
+
After preprocessing and deduplication, we have a training set of 381,034,638 unique, mostly Thai sentences with sequence length of 5 to 300 words (78.5GB). The training set has a total of 16,957,775,412 words as tokenized by dictionary-based maximal matching [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU), 8,680,485,067 subwords as tokenized by SentencePiece tokenizer, and 53,035,823,287 characters.
|
109 |
<br>
|
110 |
|
111 |
**Pretraining**
|