metadata
license: apache-2.0
language:
- ko
base_model:
- monologg/koelectra-small-v3-discriminator
library_name: transformers
KoELECTRA-small-v3-privacy-ner
This model is a fine-tuned version of monologg/koelectra-small-v3-discriminator on a synthesized privacy dataset. It achieves the following results on the evaluation set:
- f1 = 0.9998728608843798
- loss = 0.05310981854414328
- precision = 0.9999237126509853
- recall = 0.9998220142897098
Model description
ํ๊น ์์คํ : BIO ์์คํ
- -B(begin) : ๊ฐ์ฒด๋ช ์ด ์์ํ ๋
- -I(inside) : ํ ํฐ์ด ๊ฐ์ฒด๋ช ์ค๊ฐ์ ์์ ๋
- O(outside) : ํ ํฐ์ด ๊ฐ์ฒด๋ช ์ด ์๋ ๊ฒฝ์ฐ
12๊ฐ์ง ํ๊ตญ์ธ ๊ฐ์ธ์ ๋ณด ํจํด์ ๋ํ ํ๊ทธ์
๋ถ๋ฅ | ํ๊ธฐ | ์ ์ |
---|---|---|
PERSON | PER | ํ๊ตญ์ธ ์ด๋ฆ |
LOCATION | LOC | ํ๊ตญ ์ฃผ์ |
RESIDENT REGISTRATION NUMBER | RRN | ํ๊ตญ์ธ ์ฃผ๋ฏผ๋ฑ๋ก๋ฒํธ |
EMA | ์ด๋ฉ์ผ | |
ID | ID | ์ผ๋ฐ ๋ก๊ทธ์ธ ID |
PASSWORD | PWD | ์ผ๋ฐ ๋ก๊ทธ์ธ ๋น๋ฐ๋ฒํธ |
ORGANIZATION | ORG | ์์ ๊ธฐ๊ด |
PHONE NUMBER | PHN | ์ ํ๋ฒํธ |
CARD NUMBER | CRD | ์นด๋๋ฒํธ |
ACCOUNT NUMBER | ACC | ๊ณ์ข๋ฒํธ |
PASSPORT NUMBER | PSP | ์ฌ๊ถ๋ฒํธ |
DRIVER'S LICENSE NUMBER | DLN | ์ด์ ๋ฉดํ๋ฒํธ |
How to use
You can use this model with Transformers pipeline for NER.
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("amoeba04/test1")
model = AutoModelForTokenClassification.from_pretrained("amoeba04/test1")
ner = pipeline("ner", model=model, tokenizer=tokenizer)
example = "์ง๋์ฃผ, ํ๊ธธ๋ ์จ๋ ์์ธํน๋ณ์ ๊ฐ๋จ๊ตฌ์ ์์นํ ํ
ํค๋๋ก 101๋น๋ฉ์์ ์งํ๋ IT ์ปจํผ๋ฐ์ค์ ์ฐธ์ํ์ต๋๋ค."
ner_results = ner(example)
print(ner_results)
์ถ๋ ฅ: "PER-B, PER-B ์จ๋ LOC-BLOC-ILOC-I LOC-ILOC-I LOC-ILOC-I LOC-ILOC-I LOC-ILOC-ILOC-I์์ ์งํ๋ IT ์ปจํผ๋ฐ์ค์ ์ฐธ์ํ์ต๋๋ค."
Training and evaluation data
์์ฒด ์ ์ํ ํ๊ตญ์ธ ๊ฐ์ธ์ ๋ณด ํจํด ๊ธฐ๋ฐ ๊ฐ์ฒด๋ช ์ธ์ (NER) ๋ฐ์ดํฐ์
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 512
- eval_batch_size: 1024
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
- mixed_precision_training: Native AMP
Framework versions
- Transformers 4.40.0
- Pytorch 2.2.1+cu118
- Datasets 2.19.0
- Tokenizers 0.19.1