sho-takase
commited on
Commit
•
a87f531
1
Parent(s):
05e3c0a
Fix readme
Browse files
README.md
CHANGED
@@ -52,7 +52,7 @@ for t in text:
|
|
52 |
Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar.
|
53 |
We also incorporated the Web texts crawled by in-house system.
|
54 |
The total size of our training corpus is about 650 GB.
|
55 |
-
The trained model achieves 7.50 perplexity on the internal validation sets of Japanese C4
|
56 |
|
57 |
## Tokenization
|
58 |
We use a sentencepiece tokenizer with a unigram language model and byte-fallback.
|
|
|
52 |
Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar.
|
53 |
We also incorporated the Web texts crawled by in-house system.
|
54 |
The total size of our training corpus is about 650 GB.
|
55 |
+
The trained model achieves 7.50 perplexity on the internal validation sets of Japanese C4.
|
56 |
|
57 |
## Tokenization
|
58 |
We use a sentencepiece tokenizer with a unigram language model and byte-fallback.
|