everdoubling
commited on
Commit
โข
da3e6ad
1
Parent(s):
f1029a5
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,44 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
+
|
5 |
+
# ByT5-Korean - large
|
6 |
+
|
7 |
+
ByT5-Korean is a Korean version of Google's ByT5.
|
8 |
+
|
9 |
+
A Korean syllable has three components (called Jamo): a beginning consonant, a middle vowel, and an optional final consonant; they are like individual characters of alphabet.
|
10 |
+
While the ByT5's utf-8 encoding allows generic encoding for multiple languages, it is unnatural for Korean because it splits the bits representation of each Jamo in the middle.
|
11 |
+
|
12 |
+
ByT5-Korean extends ByT5's utf-8 encoding with special care for Korean syllables; each Jamo is represented with a extra token.
|
13 |
+
|
14 |
+
## Encoding Scheme
|
15 |
+
```text
|
16 |
+
id: token
|
17 |
+
0: <pad>
|
18 |
+
1: <unk>
|
19 |
+
2: <eos>
|
20 |
+
3~258: utf-8 encoding
|
21 |
+
259~277: beginning consonants(์ด์ฑ), from ใฑ to ใ
|
22 |
+
279~299: middle vowel(์ค์ฑ), from ใ
to ใ
ฃ
|
23 |
+
300~327: final consonant(์ข
์ฑ), None, from ใฑ to ใ
|
24 |
+
328~384: from <extra_id_0> to <extra_id_56>
|
25 |
+
```
|
26 |
+
## Example Inference
|
27 |
+
|
28 |
+
```python
|
29 |
+
import torch
|
30 |
+
from tokenizer import ByT5KoreanTokenizer
|
31 |
+
from transformers import T5ForConditionalGeneration
|
32 |
+
|
33 |
+
tokenizer_jamo = ByT5KoreanTokenizer()
|
34 |
+
model = T5ForConditionalGeneration.from_pretrained('everdoubling/byt5-Korean-large')
|
35 |
+
|
36 |
+
input_sentence = 'ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ(์์ด: Korean Wikipedia)๋ ํ๊ตญ์ด๋ก ์ด์๋๋ ์ํค๋ฐฑ๊ณผ์ ๋ค์ธ์ดํ ๊ฐ์ด๋ฐ ํ๋๋ก์, 2002๋
10์ 11์ผ์ <extra_id_0>. ๋ํ ํ์ฌ ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ์๋ ๋๊ฒจ์ฃผ๊ธฐ, ํ ๋ก , ๊ทธ๋ฆผ ๋ฑ ํ์ด์ง๋ก ๋ถ๋ฆฌ๋ ๋ชจ๋ ๋ฌธ์๋ฅผ ํฌํจํ๋ฉด ์ด 2,629,860๊ฐ๊ฐ <extra_id_1>๋์ด ์์ผ๋ฉฐ, ๋๊ฒจ์ฃผ๊ธฐ๋ฅผ ํฌํจํ ์ผ๋ฐ ๋ฌธ์ ์๋ 1,278,560๊ฐ,[1] ๊ทธ์ค ๋๊ฒจ์ฃผ๊ธฐ, ๋ง๋ค๋ฅธ ๋ฌธ์๋ฅผ ์ ์ธํ ์ผ๋ฐ ๋ฌธ์ ์๋ 573,149๊ฐ์ด๋ค.'
|
37 |
+
|
38 |
+
input_ids_jamo = tokenizer_jamo(input_sentence).input_ids
|
39 |
+
outputs_jamo = model_jamo.generate(torch.tensor([input_ids_jamo]))
|
40 |
+
print(tokenizer_jamo.decode(outputs_jamo[0]))
|
41 |
+
# <pad><extra_id_0>์ค๋ฆฝ๋์๋ค<extra_id_1>ฤฤ
|
42 |
+
```
|
43 |
+
|
44 |
+
Additional information coming soon...
|