everdoubling
/

byt5-Korean-large

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

everdoubling commited on Mar 4, 2022

Commit

258bc45

•

1 Parent(s): cdd237e

Update README.md

Files changed (1) hide show

README.md +5 -2

README.md CHANGED Viewed

@@ -1,15 +1,18 @@
 ---
 license: apache-2.0
 ---
 # ByT5-Korean - large
-ByT5-Korean is a Korean version of Google's ByT5.
 A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet.
 While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.
 ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token.
 ## Encoding Scheme
 ```text
@@ -27,7 +30,7 @@ id: token
 ```python
 import torch
-from tokenizer import ByT5KoreanTokenizer
 from transformers import T5ForConditionalGeneration
 tokenizer_jamo = ByT5KoreanTokenizer()

 ---
+datasets:
+- mc4
 license: apache-2.0
 ---
 # ByT5-Korean - large
+ByT5-Korean is a Korean specific extension of Google's [ByT5](https://github.com/google-research/byt5).
 A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet.
 While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.
 ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token.
+ByT5-Korean was pre-trained on [mC4](https://www.tensorflow.org/datasets/catalog/c4#c4multilingual) with 70% Korean and 30% English.
 ## Encoding Scheme
 ```text
 ```python
 import torch
+from tokenizer import ByT5KoreanTokenizer # https://github.com/everdoubling/byt5-Korean
 from transformers import T5ForConditionalGeneration
 tokenizer_jamo = ByT5KoreanTokenizer()