conan1024hao
commited on
Commit
•
e3ec7a1
1
Parent(s):
40948ff
Update README.md
Browse files
README.md
CHANGED
@@ -34,6 +34,8 @@ You can fine-tune this model on downstream tasks.
|
|
34 |
|
35 |
The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
|
36 |
|
|
|
|
|
37 |
## Vocabulary
|
38 |
|
39 |
The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
|
|
|
34 |
|
35 |
The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in advance. Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
|
36 |
|
37 |
+
`BertJapaneseTokenizer` now supports automatic `JumanppTokenizer` and `SentencepieceTokenizer`. You can use [this model](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512-with-auto-jumanpp) without any data preprocessing.
|
38 |
+
|
39 |
## Vocabulary
|
40 |
|
41 |
The vocabulary consists of 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC)) and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
|