Trofish commited on
Commit
247ea98
โ€ข
1 Parent(s): ebcaba5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -3
README.md CHANGED
@@ -1,3 +1,46 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RoBERTa-base Korean
2
+
3
+ ## ๋ชจ๋ธ ์„ค๋ช…
4
+ ์ด RoBERTa ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—์„œ **์Œ์ ˆ** ๋‹จ์œ„๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
5
+ ์ž์ฒด ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ์Œ์ ˆ ๋‹จ์œ„ vocab์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
6
+
7
+ ## ์•„ํ‚คํ…์ฒ˜
8
+ - **๋ชจ๋ธ ์œ ํ˜•**: RoBERTa
9
+ - **์•„ํ‚คํ…์ฒ˜**: RobertaForMaskedLM
10
+ - **๋ชจ๋ธ ํฌ๊ธฐ**: 128 hidden size, 8 hidden layers, 8 attention heads
11
+ - **max_position_embeddings**: 514
12
+ - **intermediate_size**: 2048
13
+ - **vocab_size**: 1428
14
+
15
+ ## ํ•™์Šต ๋ฐ์ดํ„ฐ
16
+ ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
17
+ - **๋ชจ๋‘์˜๋ง๋ญ‰์น˜**: ์ฑ„ํŒ…, ๊ฒŒ์‹œํŒ, ์ผ์ƒ๋Œ€ํ™”, ๋‰ด์Šค, ๋ฐฉ์†ก๋Œ€๋ณธ, ์ฑ… ๋“ฑ
18
+ - **AIHUB**: SNS, ์œ ํŠœ๋ธŒ ๋Œ“๊ธ€, ๋„์„œ ๋ฌธ์žฅ
19
+ - **๊ธฐํƒ€**: ๋‚˜๋ฌด์œ„ํ‚ค, ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์•„
20
+
21
+ ์ด ํ•ฉ์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 11GB ์ž…๋‹ˆ๋‹ค.
22
+
23
+ ## ํ•™์Šต ์ƒ์„ธ
24
+ - **BATCH_SIZE**: 112 (GPU๋‹น)
25
+ - **ACCUMULATE**: 36
26
+ - **MAX_STEPS**: 12,500
27
+ - **Train Steps*Batch Szie**: **100M**
28
+ - **WARMUP_STEPS**: 2,400
29
+ - **์ตœ์ ํ™”**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
30
+ - **ํ•™์Šต๋ฅ  ๊ฐ์‡ **: linear
31
+ - **์‚ฌ์šฉ๋œ ํ•˜๋“œ์›จ์–ด**: 2x RTX 8000 GPU
32
+
33
+ ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
34
+ ### tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
35
+ ### (๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)
36
+ ```python
37
+ from transformers import AutoModel, AutoTokenizer
38
+ from syllabletokenizer import SyllableTokenizer
39
+
40
+ # ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
41
+ model = AutoModelForMaskedLM.from_pretrained("Trofish/korean_syllable_roberta")
42
+ tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs)
43
+
44
+ # ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
45
+ inputs = tokenizer("์—ฌ๊ธฐ์— ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ž…๋ ฅ", return_tensors="pt")
46
+ outputs = model(**inputs)