|
--- |
|
license: mit |
|
language: |
|
- ko |
|
metrics: |
|
- accuracy |
|
--- |
|
# Model Card for KorSciDeBERTa |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
KorSciDeBERTa๋ Microsoft DeBERTa ๋ชจ๋ธ์ ์ํคํ
์ณ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก, ๋
ผ๋ฌธ, ์ฐ๊ตฌ ๋ณด๊ณ ์, ํนํ, ๋ด์ค, ํ๊ตญ์ด ์ํค ๋ง๋ญ์น ์ด 146GB๋ฅผ ์ฌ์ ํ์ตํ ๋ชจ๋ธ์
๋๋ค. |
|
|
|
๋ง์คํน๋ ์ธ์ด ๋ชจ๋ธ๋ง ๋๋ ๋ค์ ๋ฌธ์ฅ ์์ธก์ ์ฌ์ ํ์ต ๋ชจ๋ธ์ ์ฌ์ฉํ ์ ์๊ณ , ์ถ๊ฐ๋ก ๋ฌธ์ฅ ๋ถ๋ฅ, ๋จ์ด ํ ํฐ ๋ถ๋ฅ ๋๋ ์ง์์๋ต๊ณผ ๊ฐ์ ๋ค์ด์คํธ๋ฆผ ์์
์์ ๋ฏธ์ธ ์กฐ์ ์ ํตํด ์ฌ์ฉ๋ ์ ์์ต๋๋ค. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** KISTI |
|
- **Model type:** deberta-v2 |
|
- **Language(s) (NLP):** ํ๊ธ(ko) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository 1:** https://huggingface.co/kisti/korscideberta |
|
- **Repository 2:** https://aida.kisti.re.kr/ |
|
|
|
## Uses |
|
|
|
### Downstream Use |
|
|
|
### Load Huggingface model directly |
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
1. ํํ์ ๋ถ์๊ธฐ(Mecab) ๋ฑ ์ค์น ํ์ - KorSciDeBERTa ํ๊ฒฝ์ค์น+ํ์ธํ๋.pdf |
|
|
|
- Mecab ์ค์น ์ฐธ๊ณ : ๋ค์ ๋งํฌ์์ '์ฌ์ฉ๋ฐฉ๋ฒ'. https://aida.kisti.re.kr/model/9bbabd2d-6ce8-44cc-b2a3-69578d23970a |
|
|
|
- ๋ค์ ์๋ฌ ๋ฐ์์: SetuptoolsDepreciationWarning: Invalid version: '0.996/ko-0.9.2' - https://datanavigator.tistory.com/54 |
|
|
|
- Colab ์ฌ์ฉํ๋ ๊ฒฝ์ฐ Mecab ์ค์น(์์ ์ฌ์ฉ์ ์ฌ์ ์ถ๊ฐ ์ค์นํ์ง ์์ ์ ๋ฒ ์ด์ค๋ผ์ธ ์ ํ๋ 0.786์ผ๋ก ๊ฐ์ํจ): |
|
|
|
<pre><code> |
|
!git clone https://github.com/SOMJANG/Mecab-ko-for-Google-Colab.git |
|
%cd Mecab-ko-for-Google-Colab/ |
|
!bash install_mecab-ko_on_colab_light_220429.sh |
|
|
|
</code></pre> |
|
|
|
- ImportError: accelerate>=0.20.1 ์๋ฌ ๋ฐ์์ ํด๊ฒฐ๋ฒ |
|
|
|
!pip install -U accelerate; pip install -U transformers; pip install pydantic==1.8 (์ค์น ํ ๋ฐํ์ ์ฌ์์) |
|
|
|
- ํ ํฌ๋์ด์ ๋ก๋ ์๋ฌ ๋ฐ์์ ํด๊ฒฐ๋ฒ |
|
|
|
git-lfs ์ค์น ํ์ธ ๋ฐ spm.model ์ ์ ๋ค์ด๋ก๋ & ์ฉ๋(2.74mb) ํ์ธ (apt-get install git git-lfs) |
|
|
|
Make sure you have git-lfs installed (git lfs install) |
|
|
|
2. apt-get install git-lfs; git clone https://huggingface.co/kisti/korscideberta; cd korscideberta |
|
|
|
- **korscideberta-abstractcls.ipynb** |
|
|
|
<pre><code> |
|
!pip install transformers==4.36.0 |
|
from tokenization_korscideberta_v2 import DebertaV2Tokenizer |
|
from transformers import AutoModelForSequenceClassification |
|
|
|
|
|
tokenizer = DebertaV2Tokenizer.from_pretrained("kisti/korscideberta") |
|
model = AutoModelForSequenceClassification.from_pretrained("kisti/korscideberta", num_labels=7, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1) |
|
#model = AutoModelForMaskedLM.from_pretrained("kisti/korscideberta") |
|
'''''' |
|
train_metrics = trainer.train().metrics |
|
trainer.save_metrics("train", train_metrics) |
|
trainer.push_to_hub() |
|
|
|
</code></pre> |
|
|
|
### KorSciDeBERTa native code |
|
KorSciDeBERTa ํ๊ฒฝ์ค์น+ํ์ธํ๋.pdf ์ฐธ์กฐ |
|
|
|
- ํ์ธํ๋(์๋ณธ DeBERTa ์ฐธ๊ณ ): https://github.com/microsoft/DeBERTa/tree/master#run-deberta-experiments-from-command-line |
|
|
|
- **korscideberta.zip** |
|
|
|
<pre><code> |
|
apt-get install git git-lfs |
|
git clone https://huggingface.co/kisti/korscideberta; cd korscideberta; unzip korscideberta.zip -d korscideberta |
|
'''''' |
|
cd korscideberta/experiments/glue;ย chmod 777 *.sh; |
|
./mnli.sh |
|
|
|
</code></pre> |
|
|
|
### pip๋ฅผ ์ด์ฉํ KorSciDeBERTa ์ค์น |
|
์ ์ฝ๋๋ ์ฌ์ฉ ์์น ํด๋์ ๋๊ณ importํด์ ์จ์ผํด์ ๋ถํธํ ์ ์์ต๋๋ค. |
|
๊ทธ๋์, pip ๋ช
๋ น์ผ๋ก ๊ฐ์ ํ๊ฒฝ(์ฝ๋ค ํ๊ฒฝ)์ ์ค์นํ ์ ์๋๋ก pyproject.toml์ ๊ธฐ์ ํ์๊ณ , |
|
tokenization.py์์ normalize.py์ unicode.py๋ฅผ importํ ๋, "korscideberta."์ ์ถ๊ฐํ์ฌ, |
|
korscideberta ํจํค์ง๋ฅผ importํ์ฌ ์ธ ์ ์์ต๋๋ค. |
|
|
|
./ ์์น์์ ๋ค์๊ณผ ๊ฐ์ด ์คํํ๋ฉด, ํ์ฌ ์ฝ๋ค ํ๊ฒฝ์ korscideberta ๋ผ๋ ์ด๋ฆ์ผ๋ก ์ค์น๋ฉ๋๋ค. |
|
pyproject.toml์ dependencies๋ฅผ ๊ธฐ์ ํ์ฌ ํ์ํ ํจํค์ง(๋ฒ์ )(eg. sentencepiece, mecab, konlpy)์ ํ์ธํ๊ณ ๊ฐ์ด ์ค์นํฉ๋๋ค. |
|
$ pip install . |
|
|
|
์ค์น ํ์๋ ์๋์ ๊ฐ์ด importํด์ ์ฌ์ฉํ ์ ์์ต๋๋ค. |
|
|
|
<pre><code> |
|
import korscideberta |
|
tokenizer = korscideberta.tokenization_korscideberta_v2.DebertaV2Tokenizer.from_pretrained(path) |
|
|
|
</code></pre> |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
์ด ๋ชจ๋ธ์ ์๋์ ์ผ๋ก ์ฌ๋๋ค์๊ฒ ์ ๋์ ์ด๋ ์์ธ๋ ํ๊ฒฝ์ ์กฐ์ฑํ๋๋ฐ ์ฌ์ฉ๋์ด์๋ ์ ๋ฉ๋๋ค. |
|
|
|
์ด ๋ชจ๋ธ์ '๊ณ ์ํ ์ค์ '์์ ์ฌ์ฉ๋ ์ ์์ต๋๋ค. ์ด ๋ชจ๋ธ์ ์ฌ๋์ด๋ ์ฌ๋ฌผ์ ๋ํ ์ค์ํ ๊ฒฐ์ ์ ๋ด๋ฆด ์ ์๊ฒ ์ค๊ณ๋์ง ์์์ต๋๋ค. ๋ชจ๋ธ์ ์ถ๋ ฅ๋ฌผ์ ์ฌ์ค์ด ์๋ ์ ์์ต๋๋ค. |
|
|
|
'๊ณ ์ํ ์ค์ '์ ๋ค์๊ณผ ๊ฐ์ ์ฌํญ์ ํฌํจํฉ๋๋ค: |
|
|
|
์๋ฃ/์ ์น/๋ฒ๋ฅ /๊ธ์ต ๋ถ์ผ์์์ ์ฌ์ฉ, ๊ณ ์ฉ/๊ต์ก/์ ์ฉ ๋ถ์ผ์์์ ์ธ๋ฌผ ํ๊ฐ, ์๋์ผ๋ก ์ค์ํ ๊ฒ์ ๊ฒฐ์ ํ๊ธฐ, (๊ฐ์ง)์ฌ์ค์ ์์ฑํ๊ธฐ, ์ ๋ขฐ๋ ๋์ ์์ฝ๋ฌธ ์์ฑ, ํญ์ ์ณ์์ผ๋ง ํ๋ ์์ธก ์์ฑ ๋ฑ. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
์ฐ๊ตฌ๋ชฉ์ ์ผ๋ก ์ ์๊ถ ๋ฌธ์ ๊ฐ ์๋ ๋ง๋ญ์น ๋ฐ์ดํฐ๋ง์ ์ฌ์ฉํ์์ต๋๋ค. ์ด ๋ชจ๋ธ์ ์ฌ์ฉ์๋ ์๋์ ์ํ ์์ธ๋ค์ ์ธ์ํด์ผ ํฉ๋๋ค. |
|
|
|
์ฌ์ฉ๋ ๋ง๋ญ์น๋ ๋๋ถ๋ถ ์ค๋ฆฝ์ ์ธ ์ฑ๊ฒฉ์ ๊ฐ์ง๊ณ ์๋๋ฐ๋ ๋ถ๊ตฌํ๊ณ , ์ธ์ด ๋ชจ๋ธ์ ํน์ฑ์ ์๋์ ๊ฐ์ ์ค๋ฆฌ ๊ด๋ จ ์์๋ฅผ ์ผ๋ถ ํฌํจํ ์ ์์ต๋๋ค: |
|
|
|
ํน์ ๊ด์ ์ ๋ํ ๊ณผ๋/๊ณผ์ ํํ, ๊ณ ์ ๊ด๋
, ๊ฐ์ธ ์ ๋ณด, ์ฆ์ค/๋ชจ์ ๋๋ ํญ๋ ฅ์ ์ธ ์ธ์ด, ์ฐจ๋ณ์ ์ด๊ฑฐ๋ ํธ๊ฒฌ์ ์ธ ์ธ์ด, ๊ด๋ จ์ด ์๊ฑฐ๋ ๋ฐ๋ณต์ ์ธ ์ถ๋ ฅ ์์ฑ ๋ฑ. |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
๋
ผ๋ฌธ, ์ฐ๊ตฌ ๋ณด๊ณ ์, ํนํ, ๋ด์ค, ํ๊ตญ์ด ์ํค ๋ง๋ญ์น ์ด 146GB |
|
|
|
### Training Procedure |
|
|
|
KISTI HPC NVIDIA A100 80G GPU 24EA์์ 2.5๊ฐ์๋์ 1,600,000 ์คํ
ํ์ต |
|
|
|
#### Preprocessing |
|
|
|
- ๊ณผํ๊ธฐ์ ๋ถ์ผ ํ ํฌ๋์ด์ (KorSci Tokenizer) |
|
- ๋ณธ ์ฌ์ ํ์ต ๋ชจ๋ธ์์ ์ฌ์ฉ๋ ์ฝํผ์ค๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ช
์ฌ ๋ฐ ๋ณตํฉ๋ช
์ฌ ์ฝ 600๋ง๊ฐ์ ์ฌ์ฉ์์ฌ์ ์ด ์ถ๊ฐ๋ [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)์ ๊ธฐ์กด SentencePiece-BPE๊ฐ ๋ณํฉ๋์ด์ง ํ ํฌ๋์ด์ ๋ฅผ ์ฌ์ฉํ์ฌ ๋ง๋ญ์น๋ฅผ ์ ์ฒ๋ฆฌํ์์ต๋๋ค. |
|
- Total 128,100 words |
|
- Included special tokens ( < unk >, < cls >, < s >, < mask > ) |
|
- File name : spm.model, vocab.txt |
|
|
|
#### Training Hyperparameters |
|
|
|
- **model_type:** deberta-v2 |
|
- **model_size:** base |
|
- **parameters:** 180M |
|
- **hidden_size:** 768 |
|
- **num_hidden_layers:** 12 |
|
- **num_attention_heads:** 12 |
|
- **num_train_steps:** 1,600,000 |
|
- **train_batch_size:** 4,096 * 4 accumulative update = 16,384 |
|
- **learning_rate:** 1e-4 |
|
- **max_seq_length:** 512 |
|
- **vocab_size:** 128,100 |
|
- **Training regime:** fp16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Data Card if possible. --> |
|
|
|
๋ณธ ์ธ์ด๋ชจ๋ธ์ ์ฑ๋ฅํ๊ฐ๋ ๋
ผ๋ฌธ ์ฐ๊ตฌ๋ถ์ผ ๋ถ๋ฅ ๋ฐ์ดํฐ์ ํ์ธํ๋ํ์ฌ ํ๊ฐํ๋ ๋ฐฉ์์ ์ฌ์ฉํ์์ผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋ ์๋์ ๊ฐ์ต๋๋ค. |
|
- ๋
ผ๋ฌธ ์ฐ๊ตฌ๋ถ์ผ ๋ถ๋ฅ ๋ฐ์ดํฐ์
(doi.org/10.23057/50), ๋
ผ๋ฌธ 3๋ง ๊ฑด, ๋ถ๋ฅ ์นดํ
๊ณ ๋ฆฌ ์ - ๋๋ถ๋ฅ: 33๊ฐ, ์ค๋ถ๋ฅ: 372๊ฐ, ์๋ถ๋ฅ: 2898๊ฐ |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
F1-micro/macro: ์ ๋ต Top3 ์ค ์ต์ 1๊ฐ ์์ธก์ ์ฑ๊ณต ๊ธฐ์ค |
|
|
|
F1-strict: ์ ๋ต Top3 ์ค ์์ธกํ ์ ๋งํผ ์ฑ๊ณต ๊ธฐ์ค |
|
|
|
### Results |
|
|
|
F1-micro: 0.85, F1-macro: 0.52, F1-strict: 0.71 |
|
|
|
|
|
|
|
## Technical Specifications |
|
|
|
### Model Objective |
|
|
|
MLM is a technique in which you take your tokenized sample and replace some of the tokens with the < mask > token and train your model with it. The model then tries to predict what should come in the place of that < mask > token and gradually starts learning about the data. MLM teaches the model about the relationship between words. |
|
|
|
Eg. Suppose you have a sentence - 'Deep Learning is so cool! I love neural networks.', now replace few words with the < mask > token. |
|
|
|
Masked Sentence - 'Deep Learning is so < mask >! I love < mask > networks.' |
|
|
|
### Compute Infrastructure |
|
|
|
KISTI ๊ตญ๊ฐ์ํผ์ปดํจํ
์ผํฐ NEURON ์์คํ
. HPE ClusterStor E1000, HP Apollo 6500 Gen10 Plus, Lustre, Slurm, CentOS 7.9 |
|
|
|
#### Hardware |
|
|
|
NVIDIA A100 80G GPU 24EA |
|
|
|
#### Software |
|
|
|
Python 3.8, Cuda 10.2, PyTorch 1.10 |
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์ (2023) : ํ๊ตญ์ด ๊ณผํ๊ธฐ์ ๋ถ์ผ DeBERTa ์ฌ์ ํ์ต ๋ชจ๋ธ (KorSciDeBERTa). Version 1.0. ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์. |
|
|
|
|
|
## Model Card Authors |
|
|
|
๊น์ฑ์ฐฌ, ๊น๊ฒฝ๋ฏผ, ๊น์ํฌ, ์ด๋ฏผํธ, ์ด์น์ฐ. ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์ ์ธ๊ณต์ง๋ฅ๋ฐ์ดํฐ์ฐ๊ตฌ๋จ |
|
|
|
## Model Card Contact |
|
|
|
๊น์ฑ์ฐฌ, sckim kisti.re.kr |
|
๊น๊ฒฝ๋ฏผ |