|
# RoBERTa-base Korean |
|
|
|
## ๋ชจ๋ธ ์ค๋ช
|
|
์ด RoBERTa ๋ชจ๋ธ์ ๋ค์ํ ํ๊ตญ์ด ํ
์คํธ ๋ฐ์ดํฐ์
์์ **์์ ** ๋จ์๋ก ์ฌ์ ํ์ต๋์์ต๋๋ค. |
|
์์ฒด ๊ตฌ์ถํ ํ๊ตญ์ด ์์ ๋จ์ vocab์ ์ฌ์ฉํ์์ต๋๋ค. |
|
|
|
## ์ํคํ
์ฒ |
|
- **๋ชจ๋ธ ์ ํ**: RoBERTa |
|
- **์ํคํ
์ฒ**: RobertaForMaskedLM |
|
- **๋ชจ๋ธ ํฌ๊ธฐ**: 128 hidden size, 8 hidden layers, 8 attention heads |
|
- **max_position_embeddings**: 514 |
|
- **intermediate_size**: 2048 |
|
- **vocab_size**: 1428 |
|
|
|
## ํ์ต ๋ฐ์ดํฐ |
|
์ฌ์ฉ๋ ๋ฐ์ดํฐ์
์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค: |
|
- **๋ชจ๋์๋ง๋ญ์น**: ์ฑํ
, ๊ฒ์ํ, ์ผ์๋ํ, ๋ด์ค, ๋ฐฉ์ก๋๋ณธ, ์ฑ
๋ฑ |
|
- **AIHUB**: SNS, ์ ํ๋ธ ๋๊ธ, ๋์ ๋ฌธ์ฅ |
|
- **๊ธฐํ**: ๋๋ฌด์ํค, ํ๊ตญ์ด ์ํคํผ๋์ |
|
|
|
์ด ํฉ์ฐ๋ ๋ฐ์ดํฐ๋ ์ฝ 11GB ์
๋๋ค. |
|
|
|
## ํ์ต ์์ธ |
|
- **BATCH_SIZE**: 112 (GPU๋น) |
|
- **ACCUMULATE**: 36 |
|
- **Total_BATCH_SIZE**: 8064 |
|
- **MAX_STEPS**: 12,500 |
|
- **TRAIN_STEPS * BATCH_SIZE**: **100M** |
|
- **WARMUP_STEPS**: 2,400 |
|
- **์ต์ ํ**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6 |
|
- **ํ์ต๋ฅ ๊ฐ์ **: linear |
|
- **์ฌ์ฉ๋ ํ๋์จ์ด**: 2x RTX 8000 GPU |
|
|
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/TPSI6kksBLzcbloDCUgwc.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/z3_zVWsGsyT7YD9Zr9aeK.png) |
|
|
|
## ์ฌ์ฉ ๋ฐฉ๋ฒ |
|
### tokenizer์ ๊ฒฝ์ฐ wordpiece๊ฐ ์๋ syllable ๋จ์์ด๊ธฐ์ AutoTokenizer๊ฐ ์๋๋ผ SyllableTokenizer๋ฅผ ์ฌ์ฉํด์ผ ํฉ๋๋ค. |
|
### (๋ ํฌ์์ ์ ๊ณตํ๊ณ ์๋ syllabletokenizer.py๋ฅผ ๊ฐ์ ธ์์ ์ฌ์ฉํด์ผ ํฉ๋๋ค.) |
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
from syllabletokenizer import SyllableTokenizer |
|
|
|
# ๋ชจ๋ธ๊ณผ ํ ํฌ๋์ด์ ๋ถ๋ฌ์ค๊ธฐ |
|
model = AutoModelForMaskedLM.from_pretrained("Trofish/korean_syllable_roberta") |
|
tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs) |
|
|
|
# ํ
์คํธ๋ฅผ ํ ํฐ์ผ๋ก ๋ณํํ๊ณ ์์ธก ์ํ |
|
inputs = tokenizer("์ฌ๊ธฐ์ ํ๊ตญ์ด ํ
์คํธ ์
๋ ฅ", return_tensors="pt") |
|
outputs = model(**inputs) |
|
|