File size: 3,730 Bytes
8e85059
 
 
 
 
 
 
 
 
 
 
2c1575f
 
 
e93a827
 
2c1575f
 
 
 
 
 
8e85059
 
2c1575f
 
 
 
 
50b5a3d
 
8e85059
2c1575f
 
0ee90b1
 
8e85059
2c1575f
03056cd
2c1575f
 
 
 
 
0ee90b1
 
 
 
2c1575f
3d0b57f
 
8e85059
 
 
 
 
 
 
 
 
 
 
 
 
2c1575f
 
2c959b8
2c1575f
 
2c959b8
 
2c1575f
 
 
 
8e85059
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: apache-2.0
datasets:
- klue/klue
language:
- ko
metrics:
- f1
- accuracy
- pearsonr
---
# RoBERTa-base Korean

## ๋ชจ๋ธ ์„ค๋ช…
์ด RoBERTa ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—์„œ **์Œ์ ˆ** ๋‹จ์œ„๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ž์ฒด ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ์Œ์ ˆ ๋‹จ์œ„ vocab์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

## ์•„ํ‚คํ…์ฒ˜
- **๋ชจ๋ธ ์œ ํ˜•**: RoBERTa
- **์•„ํ‚คํ…์ฒ˜**: RobertaForMaskedLM
- **๋ชจ๋ธ ํฌ๊ธฐ**: 256 hidden size, 8 hidden layers, 8 attention heads
- **max_position_embeddings**: 514
- **intermediate_size**: 2,048
- **vocab_size**: 1,428

## ํ•™์Šต ๋ฐ์ดํ„ฐ
์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
- **๋ชจ๋‘์˜๋ง๋ญ‰์น˜**: ์ฑ„ํŒ…, ๊ฒŒ์‹œํŒ, ์ผ์ƒ๋Œ€ํ™”, ๋‰ด์Šค, ๋ฐฉ์†ก๋Œ€๋ณธ, ์ฑ… ๋“ฑ
- **AIHUB**: SNS, ์œ ํŠœ๋ธŒ ๋Œ“๊ธ€, ๋„์„œ ๋ฌธ์žฅ
- **๊ธฐํƒ€**: ๋‚˜๋ฌด์œ„ํ‚ค, ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์•„

 ์ด ํ•ฉ์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋Š” **์•ฝ 11GB** ์ž…๋‹ˆ๋‹ค. **(4B tokens)**

## ํ•™์Šต ์ƒ์„ธ
- **BATCH_SIZE**: 112 (GPU๋‹น)
- **ACCUMULATE**: 36
- **Total_BATCH_SIZE**: 8,064
- **MAX_STEPS**: 12,500
- **TRAIN_STEPS * BATCH_SIZE**: **100M**
- **WARMUP_STEPS**: 2,400
- **์ตœ์ ํ™”**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
- **ํ•™์Šต๋ฅ  ๊ฐ์‡ **: linear
- **์‚ฌ์šฉ๋œ ํ•˜๋“œ์›จ์–ด**: 2x RTX 8000 GPU


![Evaluation Loss Graph](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/-64jKdcJAavwgUREwaywe.png)
![Evaluation Accuracy Graph](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/LPq5M6S8LTwkFSCepD33S.png)

## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 
### (๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)

## ์„ฑ๋Šฅ ํ‰๊ฐ€
- **KLUE benchmark test๋ฅผ ํ†ตํ•ด์„œ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.**
- klue-roberta-base์— ๋น„ํ•ด์„œ ๋งค์šฐ ์ž‘์€ ํฌ๊ธฐ๋ผ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ธฐ๋Š” ํ•˜์ง€๋งŒ hidden size 512์ธ ๋ชจ๋ธ์€ ํฌ๊ธฐ ๋Œ€๋น„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/I8e60cf9w-IQCHDgKiooq.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/hkc5ko9Vo-pkKmtouN7xc.png)


## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 
### (๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)

```python
from transformers import AutoModel, AutoTokenizer
from syllabletokenizer import SyllableTokenizer

# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = AutoModelForMaskedLM.from_pretrained("Trofish/korean_syllable_roberta")
tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs)

# ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
inputs = tokenizer("์—ฌ๊ธฐ์— ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ž…๋ ฅ", return_tensors="pt")
outputs = model(**inputs)
```

## Citation
**klue**
```
@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation}, 
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```