shirayu commited on
Commit
7a7211a
1 Parent(s): 3ce8a0e

Added links

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -10,13 +10,16 @@ datasets:
10
  - wiki40b
11
  ---
12
 
13
- # t5-base-japanese-web (with Byte-fallback)
14
 
15
  ## Description
16
 
17
  [megagonlabs/t5-base-japanese-web](https://huggingface.co/megagonlabs/t5-base-japanese-web) is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
18
  Training codes are [available on GitHub](https://github.com/megagonlabs/t5-japanese).
19
 
 
 
 
20
  ### Corpora
21
 
22
  We used following corpora for pre-training.
@@ -28,7 +31,6 @@ We used following corpora for pre-training.
28
  - 828,236 articles (2,073,584 examples)
29
  - 2 GB in TFRecord format
30
 
31
-
32
  ### Tokenizer
33
 
34
  We used Japanese Wikipedia to train [SentencePiece](https://github.com/google/sentencepiece).
@@ -52,7 +54,6 @@ It took about 126 hours with TPU v3-8
52
 
53
  Apache License 2.0
54
 
55
-
56
  ## Citations
57
 
58
  - mC4
 
10
  - wiki40b
11
  ---
12
 
13
+ # t5-base-japanese-web (with Byte-fallback, 32K)
14
 
15
  ## Description
16
 
17
  [megagonlabs/t5-base-japanese-web](https://huggingface.co/megagonlabs/t5-base-japanese-web) is a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts.
18
  Training codes are [available on GitHub](https://github.com/megagonlabs/t5-japanese).
19
 
20
+ The vocabulary size of this model is 32K.
21
+ [8K version is also available](https://huggingface.co/megagonlabs/t5-base-japanese-web-8k).
22
+
23
  ### Corpora
24
 
25
  We used following corpora for pre-training.
 
31
  - 828,236 articles (2,073,584 examples)
32
  - 2 GB in TFRecord format
33
 
 
34
  ### Tokenizer
35
 
36
  We used Japanese Wikipedia to train [SentencePiece](https://github.com/google/sentencepiece).
 
54
 
55
  Apache License 2.0
56
 
 
57
  ## Citations
58
 
59
  - mC4