conan1024hao
commited on
Commit
•
3d66092
1
Parent(s):
c799aa4
update readme
Browse files
README.md
CHANGED
@@ -32,7 +32,7 @@ You can fine-tune this model on downstream tasks.
|
|
32 |
|
33 |
## Tokenization
|
34 |
|
35 |
-
`BertJapaneseTokenizer` now supports automatic tokenization for [Juman++](https://github.com/ku-nlp/jumanpp). However, if your dataset is large, you may take a long time since `BertJapaneseTokenizer` still does not supoort fast tokenization. You can still do the Juman++ tokenization by your self and use the old model [nlp-waseda/roberta-large-japanese](https://huggingface.co/nlp-waseda/roberta-large-japanese).
|
36 |
|
37 |
Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
|
38 |
|
|
|
32 |
|
33 |
## Tokenization
|
34 |
|
35 |
+
`BertJapaneseTokenizer` now supports automatic tokenization for [Juman++](https://github.com/ku-nlp/jumanpp). However, if your dataset is large, you may take a long time since `BertJapaneseTokenizer` still does not supoort fast tokenization. You can still do the Juman++ tokenization by your self and use the old model [nlp-waseda/roberta-large-japanese-seq512](https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512).
|
36 |
|
37 |
Juman++ 2.0.0-rc3 was used for pretraining. Each word is tokenized into tokens by [sentencepiece](https://github.com/google/sentencepiece).
|
38 |
|