Trofish's picture
Update README.md
b1d0372 verified
|
raw
history blame
2.09 kB
# RoBERTa-base Korean
## ๋ชจ๋ธ ์„ค๋ช…
์ด RoBERTa ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—์„œ **์Œ์ ˆ** ๋‹จ์œ„๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ž์ฒด ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ์Œ์ ˆ ๋‹จ์œ„ vocab์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
## ์•„ํ‚คํ…์ฒ˜
- **๋ชจ๋ธ ์œ ํ˜•**: RoBERTa
- **์•„ํ‚คํ…์ฒ˜**: RobertaForMaskedLM
- **๋ชจ๋ธ ํฌ๊ธฐ**: 128 hidden size, 8 hidden layers, 8 attention heads
- **max_position_embeddings**: 514
- **intermediate_size**: 2048
- **vocab_size**: 1428
## ํ•™์Šต ๋ฐ์ดํ„ฐ
์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
- **๋ชจ๋‘์˜๋ง๋ญ‰์น˜**: ์ฑ„ํŒ…, ๊ฒŒ์‹œํŒ, ์ผ์ƒ๋Œ€ํ™”, ๋‰ด์Šค, ๋ฐฉ์†ก๋Œ€๋ณธ, ์ฑ… ๋“ฑ
- **AIHUB**: SNS, ์œ ํŠœ๋ธŒ ๋Œ“๊ธ€, ๋„์„œ ๋ฌธ์žฅ
- **๊ธฐํƒ€**: ๋‚˜๋ฌด์œ„ํ‚ค, ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์•„
์ด ํ•ฉ์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 11GB ์ž…๋‹ˆ๋‹ค.
## ํ•™์Šต ์ƒ์„ธ
- **BATCH_SIZE**: 112 (GPU๋‹น)
- **ACCUMULATE**: 36
- **Total_BATCH_SIZE**: 8064
- **MAX_STEPS**: 12,500
- **TRAIN_STEPS * BATCH_SIZE**: **100M**
- **WARMUP_STEPS**: 2,400
- **์ตœ์ ํ™”**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
- **ํ•™์Šต๋ฅ  ๊ฐ์‡ **: linear
- **์‚ฌ์šฉ๋œ ํ•˜๋“œ์›จ์–ด**: 2x RTX 8000 GPU
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/TPSI6kksBLzcbloDCUgwc.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/z3_zVWsGsyT7YD9Zr9aeK.png)
## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
### (๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)
```python
from transformers import AutoModel, AutoTokenizer
from syllabletokenizer import SyllableTokenizer
# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = AutoModelForMaskedLM.from_pretrained("Trofish/korean_syllable_roberta")
tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs)
# ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
inputs = tokenizer("์—ฌ๊ธฐ์— ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ž…๋ ฅ", return_tensors="pt")
outputs = model(**inputs)