Bingsu commited on
Commit
1925157
โ€ข
1 Parent(s): 0b70b66

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ tags:
5
+ - roberta
6
+ license:
7
+ - mit
8
+ ---
9
+
10
+ ## ํ›ˆ๋ จ ์ฝ”๋“œ
11
+
12
+ ```python
13
+ from datasets import load_dataset
14
+ from tokenizers import ByteLevelBPETokenizer
15
+
16
+ tokenizer = ByteLevelBPETokenizer(unicode_normalizer="nfkc", trim_offsets=True)
17
+ ds = load_dataset("Bingsu/my-korean-training-corpus", use_auth_token=True)
18
+ # ๊ณต๊ฐœ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ
19
+ # ds = load_dataset("cc100", lang="ko") # 50GB
20
+
21
+
22
+ # ์ด ๋ฐ์ดํ„ฐ๋Š” 35GB์ด๊ณ , ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฉด ์ปดํ“จํ„ฐ๊ฐ€ ํ„ฐ์ ธ์„œ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
23
+ ds_sample = ds["train"].train_test_split(0.35, seed=20220819)["test"]
24
+
25
+
26
+ def gen_text(batch_size: int = 5000):
27
+ for i in range(0, len(ds_sample), batch_size):
28
+ yield ds_sample[i : i + batch_size]["text"]
29
+
30
+
31
+ tokenizer.train_from_iterator(
32
+ gen_text(),
33
+ vocab_size=50265,
34
+ min_frequency=2,
35
+ special_tokens=[
36
+ "<s>",
37
+ "<pad>",
38
+ "</s>",
39
+ "<unk>",
40
+ "<mask>",
41
+ ],
42
+ )
43
+ tokenizer.save("my_tokenizer.json")
44
+ ```
45
+
46
+ ์•ฝ 7์‹œ๊ฐ„ ์†Œ๋ชจ (i5-12600 non-k)
47
+ ![image](https://i.imgur.com/LNNbtGH.png)
48
+
49
+ ## ์‚ฌ์šฉ๋ฒ•
50
+
51
+ #### 1.
52
+
53
+ ```python
54
+ tokenizer = AutoTokenizer.from_pretrained("Bingsu/BBPE_tokenizer_test")
55
+
56
+ # tokenizer๋Š” RobertaTokenizerFast ํด๋ž˜์Šค๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
57
+ ```
58
+
59
+ #### 2.
60
+
61
+ `tokenizer.json`ํŒŒ์ผ์„ ๋จผ์ € ๋‹ค์šด๋ฐ›์Šต๋‹ˆ๋‹ค.
62
+
63
+ ```python
64
+ from transformers import BartTokenizerFast, BertTokenizerFast
65
+
66
+ bart_tokenizer = BartTokenizerFast(tokenizer_file="tokenizer.json")
67
+ bert_tokenizer = BertTokenizerFast(tokenizer_file="tokenizer.json")
68
+ ```
69
+
70
+ roberta์™€ ๊ฐ™์ด BBPE๋ฅผ ์‚ฌ์šฉํ•œ bart๋Š” ๋ฌผ๋ก ์ด๊ณ  bert์—๋„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
71
+ ๋‹ค๋งŒ ์ด๋ ‡๊ฒŒ ๋ถˆ๋Ÿฌ์™”์„ ๊ฒฝ์šฐ, model_max_len์ด ์ง€์ •์ด ๋˜์–ด์žˆ์ง€ ์•Š์œผ๋‹ˆ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.