Trofish commited on
Commit
2c1575f
β€’
1 Parent(s): ce87fba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -10
README.md CHANGED
@@ -1,10 +1,41 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - ko
5
- metrics:
6
- - accuracy
7
- tags:
8
- - roberta
9
- - syllable
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RoBERTa-base Korean
2
+
3
+ ## λͺ¨λΈ μ„€λͺ…
4
+ 이 RoBERTa λͺ¨λΈμ€ λ‹€μ–‘ν•œ ν•œκ΅­μ–΄ ν…μŠ€νŠΈ λ°μ΄ν„°μ…‹μ—μ„œ *음절*λ‹¨μœ„λ‘œ 사전 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
5
+
6
+ ## μ•„ν‚€ν…μ²˜
7
+ - **λͺ¨λΈ μœ ν˜•**: RoBERTa
8
+ - **μ•„ν‚€ν…μ²˜**: RobertaForMaskedLM
9
+ - **λͺ¨λΈ 크기**: 256 hidden size, 8 hidden layers, 8 attention heads
10
+ - **max_position_embeddings**: 514
11
+ - **intermediate_size**: 2048
12
+ - **vocab_size**: 1428
13
+
14
+ ## ν•™μŠ΅ 데이터
15
+ μ‚¬μš©λœ 데이터셋은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
16
+ - **λͺ¨λ‘μ˜λ§λ­‰μΉ˜**: μ±„νŒ…, κ²Œμ‹œνŒ, μΌμƒλŒ€ν™”, λ‰΄μŠ€, λ°©μ†‘λŒ€λ³Έ, μ±… λ“±
17
+ - **AIHUB**: SNS, 유튜브 λŒ“κΈ€, λ„μ„œ λ¬Έμž₯
18
+ - **기타**: λ‚˜λ¬΄μœ„ν‚€, ν•œκ΅­μ–΄ μœ„ν‚€ν”Όλ””μ•„
19
+ 총 ν•©μ‚°λœ λ°μ΄ν„°λŠ” μ•½ 11GB μž…λ‹ˆλ‹€.
20
+
21
+ ## ν•™μŠ΅ 상세
22
+ - **BATCH_SIZE**: 54 (GPUλ‹Ή)
23
+ - **ACCUMULATE**: 74
24
+ - **MAX_STEPS**: 12,500
25
+ - **Train Steps*Batch Szie**: 100M
26
+ - **WARMUP_STEPS**: 2,400
27
+ - **μ΅œμ ν™”**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
28
+ - **ν•™μŠ΅λ₯  감쇠**: linear
29
+ - **μ‚¬μš©λœ ν•˜λ“œμ›¨μ–΄**: 2x RTX 8000 GPU
30
+
31
+ ## μ‚¬μš© 방법
32
+ ```python
33
+ from transformers import AutoModel, AutoTokenizer
34
+
35
+ # λͺ¨λΈκ³Ό ν† ν¬λ‚˜μ΄μ € 뢈러였기
36
+ model = AutoModel.from_pretrained("your_model_name")
37
+ tokenizer = AutoTokenizer.from_pretrained("your_tokenizer_name")
38
+
39
+ # ν…μŠ€νŠΈλ₯Ό ν† ν°μœΌλ‘œ λ³€ν™˜ν•˜κ³  예츑 μˆ˜ν–‰
40
+ inputs = tokenizer("여기에 ν•œκ΅­μ–΄ ν…μŠ€νŠΈ μž…λ ₯", return_tensors="pt")
41
+ outputs = model(**inputs)