everdoubling commited on
Commit
258bc45
1 Parent(s): cdd237e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -2
README.md CHANGED
@@ -1,15 +1,18 @@
1
  ---
 
 
2
  license: apache-2.0
3
  ---
4
 
5
  # ByT5-Korean - large
6
 
7
- ByT5-Korean is a Korean version of Google's ByT5.
8
 
9
  A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet.
10
  While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.
11
 
12
  ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token.
 
13
 
14
  ## Encoding Scheme
15
  ```text
@@ -27,7 +30,7 @@ id: token
27
 
28
  ```python
29
  import torch
30
- from tokenizer import ByT5KoreanTokenizer
31
  from transformers import T5ForConditionalGeneration
32
 
33
  tokenizer_jamo = ByT5KoreanTokenizer()
 
1
  ---
2
+ datasets:
3
+ - mc4
4
  license: apache-2.0
5
  ---
6
 
7
  # ByT5-Korean - large
8
 
9
+ ByT5-Korean is a Korean specific extension of Google's [ByT5](https://github.com/google-research/byt5).
10
 
11
  A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet.
12
  While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.
13
 
14
  ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token.
15
+ ByT5-Korean was pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) with 70% Korean and 30% English.
16
 
17
  ## Encoding Scheme
18
  ```text
 
30
 
31
  ```python
32
  import torch
33
+ from tokenizer import ByT5KoreanTokenizer # https://github.com/everdoubling/byt5-Korean
34
  from transformers import T5ForConditionalGeneration
35
 
36
  tokenizer_jamo = ByT5KoreanTokenizer()