Trofish's picture
Update README.md
50b5a3d verified
|
raw
history blame
2.09 kB

RoBERTa-base Korean

๋ชจ๋ธ ์„ค๋ช…

์ด RoBERTa ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์Œ์ ˆ ๋‹จ์œ„๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ž์ฒด ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ์Œ์ ˆ ๋‹จ์œ„ vocab์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์•„ํ‚คํ…์ฒ˜

  • ๋ชจ๋ธ ์œ ํ˜•: RoBERTa
  • ์•„ํ‚คํ…์ฒ˜: RobertaForMaskedLM
  • ๋ชจ๋ธ ํฌ๊ธฐ: 256 hidden size, 8 hidden layers, 8 attention heads
  • max_position_embeddings: 514
  • intermediate_size: 2048
  • vocab_size: 1428

ํ•™์Šต ๋ฐ์ดํ„ฐ

์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ๋ชจ๋‘์˜๋ง๋ญ‰์น˜: ์ฑ„ํŒ…, ๊ฒŒ์‹œํŒ, ์ผ์ƒ๋Œ€ํ™”, ๋‰ด์Šค, ๋ฐฉ์†ก๋Œ€๋ณธ, ์ฑ… ๋“ฑ
  • AIHUB: SNS, ์œ ํŠœ๋ธŒ ๋Œ“๊ธ€, ๋„์„œ ๋ฌธ์žฅ
  • ๊ธฐํƒ€: ๋‚˜๋ฌด์œ„ํ‚ค, ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์•„

์ด ํ•ฉ์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 11GB ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ์ƒ์„ธ

  • BATCH_SIZE: 112 (GPU๋‹น)
  • ACCUMULATE: 36
  • MAX_STEPS: 12,500
  • Train Steps*Batch Szie: 100M
  • WARMUP_STEPS: 2,400
  • ์ตœ์ ํ™”: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
  • ํ•™์Šต๋ฅ  ๊ฐ์‡ : linear
  • ์‚ฌ์šฉ๋œ ํ•˜๋“œ์›จ์–ด: 2x RTX 8000 GPU

Evaluation Loss Graph

Evaluation Accuracy Graph

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

(๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)

from transformers import AutoModel, AutoTokenizer
from syllabletokenizer import SyllableTokenizer

# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = AutoModelForMaskedLM.from_pretrained("Trofish/korean_syllable_roberta")
tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs)

# ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
inputs = tokenizer("์—ฌ๊ธฐ์— ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ž…๋ ฅ", return_tensors="pt")
outputs = model(**inputs)