๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ BERT ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (KorSci BERT)

๋ณธ KorSci BERT ์–ธ์–ด๋ชจ๋ธ์€ ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ธฐ์กด Google BERT base ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ•œ๊ตญ ๋…ผ๋ฌธ & ํŠนํ—ˆ ์ฝ”ํผ์Šค ์ด 97G (์•ฝ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ)๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค.

Train dataset

Type Corpus Sentences Avg sent length
๋…ผ๋ฌธ 15G 72,735,757 122.11
ํŠนํ—ˆ 82G 316,239,927 120.91
ํ•ฉ๊ณ„ 97G 388,975,684 121.13

Model architecture

  • attention_probs_dropout_prob:0.1
  • directionality:"bidi"
  • hidden_act:"gelu"
  • hidden_dropout_prob:0.1
  • hidden_size:768
  • initializer_range:0.02
  • intermediate_size:3072
  • max_position_embeddings:512
  • num_attention_heads:12
  • num_hidden_layers:12
  • pooler_fc_size:768
  • pooler_num_attention_heads:12
  • pooler_num_fc_layers:3
  • pooler_size_per_head:128
  • pooler_type:"first_token_transform"
  • type_vocab_size:2
  • vocab_size:15330

Vocabulary

  • Total 15,330 words
  • Included special tokens ( [PAD], [UNK], [CLS], [SEP], [MASK] )
  • File name : vocab_kisti.txt

Language model

  • Model file : model.ckpt-262500 (Tensorflow ckpt file)

Pre training

  • Trained 128 Seq length 1,600,000 + 512 Seq length 500,000 ์Šคํ… ํ•™์Šต
  • ๋…ผ๋ฌธ+ํŠนํ—ˆ (97 GB) ๋ง๋ญ‰์น˜์˜ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ ํ•™์Šต
  • NVIDIA V100 32G 8EA GPU ๋ถ„์‚ฐํ•™์Šต with Horovod Lib
  • NVIDIA Automixed Mixed Precision ๋ฐฉ์‹ ์‚ฌ์šฉ

Downstream task evaluation

๋ณธ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜ ๋ฐ ํŠนํ—ˆ ์„ ์ง„ํŠนํ—ˆ๋ถ„๋ฅ˜(CPC) 2๊ฐ€์ง€์˜ ํƒœ์Šคํฌ๋ฅผ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Type Classes Train Test Metric Train result Test result
๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜ 86 130,515 14,502 Accuracy 68.21 70.31
ํŠนํ—ˆCPC๋ถ„๋ฅ˜ 144 390,540 16,315 Accuracy 86.87 76.25

๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ ํ† ํฌ๋‚˜์ด์ € (KorSci Tokenizer)

๋ณธ ํ† ํฌ๋‚˜์ด์ €๋Š” ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ , ์œ„ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ช…์‚ฌ ๋ฐ ๋ณตํ•ฉ๋ช…์‚ฌ ์•ฝ 600๋งŒ๊ฐœ์˜ ์‚ฌ์šฉ์ž์‚ฌ์ „์ด ์ถ”๊ฐ€๋œ Mecab-ko Tokenizer์™€ ๊ธฐ์กด BERT WordPiece Tokenizer๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์ง„ ํ† ํฌ๋‚˜์ด์ €์ด๋‹ค.

๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ

http://doi.org/10.23057/46

์š”๊ตฌ์‚ฌํ•ญ

์€์ „ํ•œ๋‹ข Mecab ์„ค์น˜ & ์‚ฌ์šฉ์ž์‚ฌ์ „ ์ถ”๊ฐ€

Installation URL: https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/
mecab-ko > 0.996-ko-0.9.2
mecab-ko-dic > 2.1.1
mecab-python > 0.996-ko-0.9.2

๋…ผ๋ฌธ & ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „

  • ๋…ผ๋ฌธ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pap_all_mecab_dic.csv (1,001,328 words)
  • ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pat_all_mecab_dic.csv (5,000,000 words)

konlpy ์„ค์น˜

pip install konlpy
konlpy > 0.5.2

์‚ฌ์šฉ๋ฐฉ๋ฒ•

import tokenization_kisti as tokenization
 
vocab_file = "./vocab_kisti.txt"  

tokenizer = tokenization.FullTokenizer(  
                        vocab_file=vocab_file,  
                        do_lower_case=False,  
                        tokenizer_type="Mecab"  
                    )  

example = "๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค."  
tokens = tokenizer.tokenize(example)  
encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)  
decoded_tokens = tokenizer.convert_ids_to_tokens(encoded_tokens)  
  
print("Input example ===>", example)  
print("Tokenized example ===>", tokens)  
print("Converted example to IDs ===>", encoded_tokens)  
print("Converted IDs to example ===>", decoded_tokens)

============ Result ================
Input example ===> ๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค.
Tokenized example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
Converted example to IDs ===> [59, 619, 30, 2336, 8268, 819, 14100, 13986, 14198, 15, 732, 13994, 14615, 39, 1964, 12, 11, 6174, 14198, 14061, 9, 366, 16, 7267, 18, 32, 307, 14072, 891, 13967, 27, 6174, 14198, 14, 698, 27, 11, 12920, 1972, 32, 4482, 15, 2228, 14053, 12, 65, 117, 12, 4477, 366, 10, 56, 39, 26, 11, 6174, 14198, 15, 1637, 13709, 398, 25, 26, 140, 12, 11, 819, 14100, 13986, 377, 14061, 10, 487, 55, 14, 17, 13]
Converted IDs to example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']

Fine-tuning with KorSci-Bert

  • Google Bert์˜ Fine-tuning ๋ฐฉ๋ฒ• ์ฐธ๊ณ 
  • Sentence (and sentence-pair) classification tasks: "run_classifier.py" ์ฝ”๋“œ ํ™œ์šฉ
  • MRC(Machine Reading Comprehension) tasks: "run_squad.py" ์ฝ”๋“œ ํ™œ์šฉ
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.