lalital commited on
Commit
dcd7863
·
1 Parent(s): 144b06e

Add padding token <pad> at the end of input sentence

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  widget:
3
- - text: "ผู้ใช้งานสนามบินนานาชาติ<mask>มีกว่าสามล้านคน"
4
  ---
5
 
6
  # WangchanBERTa base model: `wangchanberta-base-att-spm-uncased`
@@ -80,14 +80,14 @@ The getting started notebook of WangchanBERTa model can be found at this [Colab
80
 
81
  Texts are preprocessed with the following rules:
82
 
83
- - Replace HTML forms of characters with the actual characters such asnbsp;with a space and \\\\<br /> with a line break [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
84
  - Remove empty brackets ((), {}, and []) than sometimes come up as a result of text extraction such as from Wikipedia.
85
  - Replace line breaks with spaces.
86
  - Replace more than one spaces with a single space
87
  - Remove more than 3 repetitive characters such as ดีมากกก to ดีมาก [Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
88
  - Word-level tokenization using [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU) ’s `newmm` dictionary-based maximal matching tokenizer.
89
  - Replace repetitive words; this is done post-tokenization unlike [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146). since there is no delimitation by space in Thai as in English.
90
- - Replace spaces with <\\\\_>. The SentencePiece tokenizer combines the spaces with other tokens. Since spaces serve as punctuation in Thai such as sentence boundaries similar to periods in English, combining it with other tokens will omit an important feature for tasks such as word tokenization and sentence breaking. Therefore, we opt to explicitly mark spaces with <\\\\_>.
91
 
92
  <br>
93
 
@@ -109,7 +109,7 @@ After preprocessing and deduplication, we have a training set of 381,034,638 uni
109
 
110
  **Pretraining**
111
 
112
- The model was trained on 8 V100 GPUs for 500,000 steps with the batch size of 4,096 (32 sequences per device with 16 accumulation steps) and a sequence length of 416 tokens. The optimizer we used is Adam with the learning rate of $3e-4$, $\\\\beta_1 = 0.9$, $\\\\beta_2= 0.999$ and $\\\\epsilon = 1e-6$. The learning rate is warmed up for the first 24,000 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint.
113
 
114
  As of Sun 24 Jan 2021, we release the model from the checkpoint @360,000 steps due to the model pretraining has not yet been completed
115
 
 
1
  ---
2
  widget:
3
+ - text: "ผู้ใช้งานสนามบินนานาชาติ<mask>มีกว่าสามล้านคน<pad>"
4
  ---
5
 
6
  # WangchanBERTa base model: `wangchanberta-base-att-spm-uncased`
 
80
 
81
  Texts are preprocessed with the following rules:
82
 
83
+ - Replace HTML forms of characters with the actual characters such asnbsp;with a space and \\\\\\\\<br /> with a line break [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
84
  - Remove empty brackets ((), {}, and []) than sometimes come up as a result of text extraction such as from Wikipedia.
85
  - Replace line breaks with spaces.
86
  - Replace more than one spaces with a single space
87
  - Remove more than 3 repetitive characters such as ดีมากกก to ดีมาก [Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146).
88
  - Word-level tokenization using [[Phatthiyaphaibun et al., 2020]](https://zenodo.org/record/4319685#.YA4xEGQzaDU) ’s `newmm` dictionary-based maximal matching tokenizer.
89
  - Replace repetitive words; this is done post-tokenization unlike [[Howard and Ruder, 2018]](https://arxiv.org/abs/1801.06146). since there is no delimitation by space in Thai as in English.
90
+ - Replace spaces with <\\\\\\\\_>. The SentencePiece tokenizer combines the spaces with other tokens. Since spaces serve as punctuation in Thai such as sentence boundaries similar to periods in English, combining it with other tokens will omit an important feature for tasks such as word tokenization and sentence breaking. Therefore, we opt to explicitly mark spaces with <\\\\\\\\_>.
91
 
92
  <br>
93
 
 
109
 
110
  **Pretraining**
111
 
112
+ The model was trained on 8 V100 GPUs for 500,000 steps with the batch size of 4,096 (32 sequences per device with 16 accumulation steps) and a sequence length of 416 tokens. The optimizer we used is Adam with the learning rate of $3e-4$, $\\\\\\\\beta_1 = 0.9$, $\\\\\\\\beta_2= 0.999$ and $\\\\\\\\epsilon = 1e-6$. The learning rate is warmed up for the first 24,000 steps and linearly decayed to zero. The model checkpoint with minimum validation loss will be selected as the best model checkpoint.
113
 
114
  As of Sun 24 Jan 2021, we release the model from the checkpoint @360,000 steps due to the model pretraining has not yet been completed
115