File size: 1,757 Bytes
2c1575f
 
 
e93a827
 
2c1575f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ee90b1
 
2c1575f
0ee90b1
2c1575f
 
 
 
 
0ee90b1
 
 
 
 
2c1575f
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# RoBERTa-base Korean

## λͺ¨λΈ μ„€λͺ…
이 RoBERTa λͺ¨λΈμ€ λ‹€μ–‘ν•œ ν•œκ΅­μ–΄ ν…μŠ€νŠΈ λ°μ΄ν„°μ…‹μ—μ„œ **음절** λ‹¨μœ„λ‘œ 사전 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
자체 κ΅¬μΆ•ν•œ ν•œκ΅­μ–΄ 음절 λ‹¨μœ„ vocab을 μ‚¬μš©ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

## μ•„ν‚€ν…μ²˜
- **λͺ¨λΈ μœ ν˜•**: RoBERTa
- **μ•„ν‚€ν…μ²˜**: RobertaForMaskedLM
- **λͺ¨λΈ 크기**: 256 hidden size, 8 hidden layers, 8 attention heads
- **max_position_embeddings**: 514
- **intermediate_size**: 2048
- **vocab_size**: 1428

## ν•™μŠ΅ 데이터
μ‚¬μš©λœ 데이터셋은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
- **λͺ¨λ‘μ˜λ§λ­‰μΉ˜**: μ±„νŒ…, κ²Œμ‹œνŒ, μΌμƒλŒ€ν™”, λ‰΄μŠ€, λ°©μ†‘λŒ€λ³Έ, μ±… λ“±
- **AIHUB**: SNS, 유튜브 λŒ“κΈ€, λ„μ„œ λ¬Έμž₯
- **기타**: λ‚˜λ¬΄μœ„ν‚€, ν•œκ΅­μ–΄ μœ„ν‚€ν”Όλ””μ•„
총 ν•©μ‚°λœ λ°μ΄ν„°λŠ” μ•½ 11GB μž…λ‹ˆλ‹€.

## ν•™μŠ΅ 상세
- **BATCH_SIZE**: 112 (GPUλ‹Ή)
- **ACCUMULATE**: 36
- **MAX_STEPS**: 12,500
- **Train Steps*Batch Szie**: **100M**
- **WARMUP_STEPS**: 2,400
- **μ΅œμ ν™”**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
- **ν•™μŠ΅λ₯  감쇠**: linear
- **μ‚¬μš©λœ ν•˜λ“œμ›¨μ–΄**: 2x RTX 8000 GPU


![Evaluation Loss Graph](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/-64jKdcJAavwgUREwaywe.png)

![Evaluation Accuracy Graph](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/LPq5M6S8LTwkFSCepD33S.png)

## μ‚¬μš© 방법
```python
from transformers import AutoModel, AutoTokenizer

# λͺ¨λΈκ³Ό ν† ν¬λ‚˜μ΄μ € 뢈러였기
model = AutoModel.from_pretrained("your_model_name")
tokenizer = AutoTokenizer.from_pretrained("your_tokenizer_name")

# ν…μŠ€νŠΈλ₯Ό ν† ν°μœΌλ‘œ λ³€ν™˜ν•˜κ³  예츑 μˆ˜ν–‰
inputs = tokenizer("여기에 ν•œκ΅­μ–΄ ν…μŠ€νŠΈ μž…λ ₯", return_tensors="pt")
outputs = model(**inputs)