Bingsu's picture
Create README.md
1925157
|
raw
history blame
1.72 kB
metadata
language:
  - ko
tags:
  - roberta
license:
  - mit

ํ›ˆ๋ จ ์ฝ”๋“œ

from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer(unicode_normalizer="nfkc", trim_offsets=True)
ds = load_dataset("Bingsu/my-korean-training-corpus", use_auth_token=True)
# ๊ณต๊ฐœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ
# ds = load_dataset("cc100", lang="ko")  # 50GB


# ์ด ๋ฐ์ดํ„ฐ๋Š” 35GB์ด๊ณ , ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฉด ์ปดํ“จํ„ฐ๊ฐ€ ํ„ฐ์ ธ์„œ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
ds_sample = ds["train"].train_test_split(0.35, seed=20220819)["test"]


def gen_text(batch_size: int = 5000):
    for i in range(0, len(ds_sample), batch_size):
        yield ds_sample[i : i + batch_size]["text"]


tokenizer.train_from_iterator(
    gen_text(),
    vocab_size=50265,
    min_frequency=2,
    special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>",
    ],
)
tokenizer.save("my_tokenizer.json")

์•ฝ 7์‹œ๊ฐ„ ์†Œ๋ชจ (i5-12600 non-k) image

์‚ฌ์šฉ๋ฒ•

1.

tokenizer = AutoTokenizer.from_pretrained("Bingsu/BBPE_tokenizer_test")

# tokenizer๋Š” RobertaTokenizerFast ํด๋ž˜์Šค๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

2.

tokenizer.jsonํŒŒ์ผ์„ ๋จผ์ € ๋‹ค์šด๋ฐ›์Šต๋‹ˆ๋‹ค.

from transformers import BartTokenizerFast, BertTokenizerFast

bart_tokenizer = BartTokenizerFast(tokenizer_file="tokenizer.json")
bert_tokenizer = BertTokenizerFast(tokenizer_file="tokenizer.json")

roberta์™€ ๊ฐ™์ด BBPE๋ฅผ ์‚ฌ์šฉํ•œ bart๋Š” ๋ฌผ๋ก ์ด๊ณ  bert์—๋„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ด๋ ‡๊ฒŒ ๋ถˆ๋Ÿฌ์™”์„ ๊ฒฝ์šฐ, model_max_len์ด ์ง€์ •์ด ๋˜์–ด์žˆ์ง€ ์•Š์œผ๋‹ˆ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.