RoBERTa-base Korean
λͺ¨λΈ μ€λͺ
μ΄ RoBERTa λͺ¨λΈμ λ€μν νκ΅μ΄ ν μ€νΈ λ°μ΄ν°μ μμ μμ λ¨μλ‘ μ¬μ νμ΅λμμ΅λλ€. μ체 ꡬμΆν νκ΅μ΄ μμ λ¨μ vocabμ μ¬μ©νμμ΅λλ€.
μν€ν μ²
- λͺ¨λΈ μ ν: RoBERTa
- μν€ν μ²: RobertaForMaskedLM
- λͺ¨λΈ ν¬κΈ°: 256 hidden size, 8 hidden layers, 8 attention heads
- max_position_embeddings: 514
- intermediate_size: 2048
- vocab_size: 1428
νμ΅ λ°μ΄ν°
μ¬μ©λ λ°μ΄ν°μ μ λ€μκ³Ό κ°μ΅λλ€:
- λͺ¨λμλ§λμΉ: μ±ν , κ²μν, μΌμλν, λ΄μ€, λ°©μ‘λλ³Έ, μ± λ±
- AIHUB: SNS, μ νλΈ λκΈ, λμ λ¬Έμ₯
- κΈ°ν: λ무μν€, νκ΅μ΄ μν€νΌλμ μ΄ ν©μ°λ λ°μ΄ν°λ μ½ 11GB μ λλ€.
νμ΅ μμΈ
- BATCH_SIZE: 112 (GPUλΉ)
- ACCUMULATE: 36
- MAX_STEPS: 12,500
- Train Steps*Batch Szie: 100M
- WARMUP_STEPS: 2,400
- μ΅μ ν: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
- νμ΅λ₯ κ°μ : linear
- μ¬μ©λ νλμ¨μ΄: 2x RTX 8000 GPU
μ¬μ© λ°©λ²
from transformers import AutoModel, AutoTokenizer
# λͺ¨λΈκ³Ό ν ν¬λμ΄μ λΆλ¬μ€κΈ°
model = AutoModel.from_pretrained("your_model_name")
tokenizer = AutoTokenizer.from_pretrained("your_tokenizer_name")
# ν
μ€νΈλ₯Ό ν ν°μΌλ‘ λ³ννκ³ μμΈ‘ μν
inputs = tokenizer("μ¬κΈ°μ νκ΅μ΄ ν
μ€νΈ μ
λ ₯", return_tensors="pt")
outputs = model(**inputs)