Chessmen commited on
Commit
cccb5a0
1 Parent(s): 0a7dfab

Upload 9 files

Browse files
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: distilbert-base-uncased
5
+ tags:
6
+ - generated_from_trainer
7
+ model-index:
8
+ - name: fine_tune_distilbert-base-uncased
9
+ results: []
10
+ ---
11
+
12
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
+ should probably proofread and complete it, then remove this comment. -->
14
+
15
+ # fine_tune_distilbert-base-uncased
16
+
17
+ This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) on an unknown dataset.
18
+ It achieves the following results on the evaluation set:
19
+ - Loss: 2.1226
20
+ - Model Preparation Time: 0.0016
21
+
22
+ ## Model description
23
+
24
+ More information needed
25
+
26
+ ## Intended uses & limitations
27
+
28
+ More information needed
29
+
30
+ ## Training and evaluation data
31
+
32
+ More information needed
33
+
34
+ ## Training procedure
35
+
36
+ ### Training hyperparameters
37
+
38
+ The following hyperparameters were used during training:
39
+ - learning_rate: 2e-05
40
+ - train_batch_size: 64
41
+ - eval_batch_size: 64
42
+ - seed: 42
43
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
44
+ - lr_scheduler_type: linear
45
+ - num_epochs: 20
46
+ - mixed_precision_training: Native AMP
47
+
48
+ ### Training results
49
+
50
+ | Training Loss | Epoch | Step | Validation Loss | Model Preparation Time |
51
+ |:-------------:|:-----:|:-----:|:---------------:|:----------------------:|
52
+ | 2.5551 | 1.0 | 767 | 2.3648 | 0.0016 |
53
+ | 2.4329 | 2.0 | 1534 | 2.3181 | 0.0016 |
54
+ | 2.3874 | 3.0 | 2301 | 2.2831 | 0.0016 |
55
+ | 2.3409 | 4.0 | 3068 | 2.2422 | 0.0016 |
56
+ | 2.3124 | 5.0 | 3835 | 2.2302 | 0.0016 |
57
+ | 2.2895 | 6.0 | 4602 | 2.2104 | 0.0016 |
58
+ | 2.2649 | 7.0 | 5369 | 2.2014 | 0.0016 |
59
+ | 2.2445 | 8.0 | 6136 | 2.1939 | 0.0016 |
60
+ | 2.234 | 9.0 | 6903 | 2.1776 | 0.0016 |
61
+ | 2.2142 | 10.0 | 7670 | 2.1607 | 0.0016 |
62
+ | 2.208 | 11.0 | 8437 | 2.1682 | 0.0016 |
63
+ | 2.1933 | 12.0 | 9204 | 2.1530 | 0.0016 |
64
+ | 2.1808 | 13.0 | 9971 | 2.1493 | 0.0016 |
65
+ | 2.1689 | 14.0 | 10738 | 2.1422 | 0.0016 |
66
+ | 2.1598 | 15.0 | 11505 | 2.1347 | 0.0016 |
67
+ | 2.1567 | 16.0 | 12272 | 2.1373 | 0.0016 |
68
+ | 2.1458 | 17.0 | 13039 | 2.1270 | 0.0016 |
69
+ | 2.1475 | 18.0 | 13806 | 2.1200 | 0.0016 |
70
+ | 2.141 | 19.0 | 14573 | 2.1312 | 0.0016 |
71
+ | 2.1423 | 20.0 | 15340 | 2.1202 | 0.0016 |
72
+
73
+
74
+ ### Framework versions
75
+
76
+ - Transformers 4.44.2
77
+ - Pytorch 2.2.0+cu121
78
+ - Datasets 2.21.0
79
+ - Tokenizers 0.19.1
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-uncased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForMaskedLM"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "pad_token_id": 0,
17
+ "qa_dropout": 0.1,
18
+ "seq_classif_dropout": 0.2,
19
+ "sinusoidal_pos_embds": false,
20
+ "tie_weights_": true,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.44.2",
23
+ "vocab_size": 30522
24
+ }
maskedlanguagemodel_pytorch.py ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datasets import load_dataset
2
+ from sympy import Line2D
3
+ from transformers import (
4
+ AutoTokenizer,
5
+ DataCollatorForLanguageModeling,
6
+ AutoModelForMaskedLM,
7
+ TrainingArguments,
8
+ Trainer,
9
+ pipeline,
10
+ )
11
+ import evaluate
12
+ import numpy as np
13
+ import torch
14
+ import math
15
+
16
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
17
+ print(device)
18
+
19
+ class MaskedLM():
20
+ def __init__(self):
21
+ self.model = None
22
+ self.metric = None
23
+ self.data_collator = None
24
+ self.raw_data = None
25
+ self.model_checkpoint = None
26
+ self.tokenized_dataset = None
27
+ self.chunk_size = 128
28
+ self.chunks_dataset = None
29
+ self.split_dataset = None
30
+ self.args = None
31
+
32
+ def load_dataset(self, name="imdb"):
33
+ self.raw_data = load_dataset(name)
34
+ print("Name of dataset: ", name)
35
+ print(self.raw_data)
36
+
37
+ def load_support(self, mlm_probability=0.15):
38
+ self.model_checkpoint = "distilbert-base-uncased"
39
+ self.tokenizer = AutoTokenizer.from_pretrained(self.model_checkpoint)
40
+ self.data_collator = DataCollatorForLanguageModeling(tokenizer=self.tokenizer, mlm_probability=mlm_probability)
41
+ print("Name of model checkpoint: " + self.model_checkpoint)
42
+ print("Tokenizer Fast: ", self.tokenizer.is_fast)
43
+ print("Symbol of masked word after tokenizer: ", self.tokenizer.mask_token)
44
+ print("Model max length tokenizer: ", self.tokenizer.model_max_length)
45
+
46
+ def explore_infoModel(self, k=5):
47
+ model = AutoModelForMaskedLM.from_pretrained(self.model_checkpoint)
48
+ model_parameters = model.num_parameters() / 1000000
49
+ print(f">>> Number of parameters of {self.model_checkpoint}: {round(model_parameters)}M")
50
+
51
+ example = "This is a great [MASK]."
52
+ print("\n")
53
+ print(">>> Example: ", example)
54
+ inputs = self.tokenizer(example, return_tensors='pt')
55
+ token_logits = model(**inputs).logits
56
+ print(f"{'Number of tokens: ':<{30}}{ len(inputs.tokens())}")
57
+ print(f"{'Tokens prepare for training: ':<{30}}{ inputs.tokens()}")
58
+ print(f"{'IDs of tokens: ':<{30}}{ inputs.input_ids}")
59
+ print(f"{'Maked IDs of Model: ':<{30}}{self.tokenizer.mask_token_id}")
60
+ print(f"{'Logits of example: ':<{30}}{token_logits}")
61
+ print(f"{'Shape of logits: ':<{30}}{token_logits.size()}")
62
+
63
+ '''Find POSITION of [MASK] => EXTRACT LOGIT'''
64
+ mask_token_index = torch.where(inputs.input_ids == self.tokenizer.mask_token_id)[1]
65
+ print(f"{'Position of masked token index: ':<{30}}{mask_token_index}")
66
+
67
+ '''Find LOGIT of token in VOCAB suitable for [MASK]'''
68
+ mask_token_logits = token_logits[0, mask_token_index, :]
69
+ print(f"{'Logit of tokens in Vocab for [MASK]: ':<{30}}{mask_token_logits}")
70
+
71
+ '''Choose TOP CANDIDATES for [MASK] with highest logits => TOP LOGITS + POSITION of token suitable for [MASK] in VOCAB'''
72
+ top_k_values = torch.topk(mask_token_logits, k, dim=1).values[0].tolist()
73
+ print(f"{'Top value of suitable token in Vocab: ':<{30}}{top_k_values }")
74
+ top_k_tokens = torch.topk(mask_token_logits, k, dim=1).indices[0].tolist()
75
+ print(f"{'Position of suitable token in Vocab: ':<{30}}{top_k_tokens}")
76
+
77
+ '''Show TOP CANDIDATES'''
78
+ for token in top_k_tokens:
79
+ print(">>> ", example.replace(self.tokenizer.mask_token, self.tokenizer.decode([token])))
80
+
81
+
82
+ def get_feature_items(self, set="train", index=0, feature="text"):
83
+ return None if self.raw_data[set][index][feature] is None or self.raw_data[set][index][feature] == 0 else self.raw_data[set][index][feature]
84
+
85
+ def get_pair_items(self, set="train", index=0, feature1="text", feature2="label"):
86
+ feature1 = self.get_feature_items(set, index, feature1)
87
+ feature2 = self.get_feature_items(set, index, feature2)
88
+ if feature2 is not None:
89
+ line1 = ""
90
+ line2 = ""
91
+ for word, label in zip(feature1, feature2):
92
+ line1 += str(word)
93
+ line2 += str(label)
94
+ return line1, line2
95
+
96
+ return feature1, feature1
97
+
98
+ def get_tokenizer(self, set="train", index=0, feature="text"):
99
+ inputs = self.tokenizer(self.get_feature_items(set, index, feature))
100
+ return inputs.tokens(), inputs.word_ids()
101
+
102
+ def tokenizer_dataset(self, example):
103
+ inputs = self.tokenizer(example["text"])
104
+ inputs["word_ids"] = [inputs.word_ids(i) for i in range(len(inputs["input_ids"]))]
105
+ return inputs
106
+
107
+ def map_tokenize_dataset(self):
108
+ print("Start of processing dataset")
109
+ self.tokenized_dataset = self.raw_data.map(self.tokenizer_dataset, batched=True, remove_columns=["text","label"] )
110
+ print("Done mapping")
111
+ print("Tokenized dataset: ", self.tokenized_dataset)
112
+
113
+ def group_text_chunk(self, example):
114
+ '''Group all of text'''
115
+ concatenate_example = {k : sum(example[k], []) for k in example.keys()}
116
+
117
+ '''Compute the length of all'''
118
+ total_length = len(concatenate_example["input_ids"])
119
+ '''Final length for chunk size'''
120
+ total_length = (total_length // self.chunk_size) *self.chunk_size
121
+
122
+ '''Divide into chunks with chunk size'''
123
+ chunks = {
124
+ k: [t[i: i + self.chunk_size] for i in range(0, total_length, self.chunk_size)]
125
+ for k, t in concatenate_example.items()
126
+ }
127
+
128
+ '''Create LABELS column from INPUT_IDS'''
129
+ chunks["labels"] = chunks["input_ids"].copy()
130
+ return chunks
131
+
132
+ def map_chunk_dataset(self):
133
+ print("Start of processing dataset")
134
+ self.chunks_dataset = self.tokenized_dataset.map(self.group_text_chunk, batched=True)
135
+ print("Done mapping")
136
+ print("Chunked dataset: ", self.chunks_dataset)
137
+
138
+ def dataset_split(self, test_size=0.2):
139
+ self.split_dataset = self.chunks_dataset["train"].train_test_split(
140
+ test_size=test_size, seed=42
141
+ )
142
+ print("Preparing dataset: ", self.split_dataset)
143
+
144
+
145
+ def create_model(self):
146
+ print("Start creating model")
147
+ self.model = AutoModelForMaskedLM.from_pretrained(self.model_checkpoint)
148
+ print(self.model)
149
+
150
+ def create_argumentTrainer(self, output_dir="fine_tuned_", eval_strategy="epoch", logging_strategy="epoch",
151
+ learning_rate=2e-5, num_train_epochs=20, weight_decay=0.01, batch_size=64,
152
+ save_strategy="epoch", push_to_hub=False, hub_model_id="", fp16=True):
153
+ logging_steps = len(self.split_dataset["train"]) // batch_size
154
+ self.args= TrainingArguments(
155
+ #use_cpu=True,
156
+ output_dir=f"{output_dir}{self.model_checkpoint}",
157
+ overwrite_output_dir=True,
158
+ eval_strategy=eval_strategy,
159
+ save_strategy=save_strategy,
160
+ weight_decay=weight_decay,
161
+ learning_rate=learning_rate,
162
+ num_train_epochs=num_train_epochs,
163
+ per_device_train_batch_size=batch_size,
164
+ per_device_eval_batch_size=batch_size,
165
+ push_to_hub=push_to_hub,
166
+ hub_model_id=hub_model_id,
167
+ fp16=fp16,
168
+ logging_steps=logging_steps
169
+ )
170
+ print("Arguments ready for training")
171
+ return self.args
172
+
173
+ def call_train(self, model_path="pretrained_model_", set_train="train", set_val="test", push_to_hub=False, save_local=False):
174
+ trainer = Trainer(
175
+ model=self.model,
176
+ args=self.args,
177
+ train_dataset=self.split_dataset[set_train],
178
+ eval_dataset=self.split_dataset[set_val],
179
+ data_collator=self.data_collator,
180
+ tokenizer=self.tokenizer,
181
+ )
182
+ eval_result1 = trainer.evaluate()
183
+ print("Perplexity before of training: ", math.exp(eval_result1['eval_loss']))
184
+
185
+ print("Start training")
186
+ trainer.train()
187
+ print("Done training")
188
+
189
+ eval_result2 = trainer.evaluate()
190
+ print("Perplexity after of training: ", math.exp(eval_result2['eval_loss']))
191
+
192
+ if save_local:
193
+ trainer.save_model(model_path+self.model_checkpoint)
194
+ print("Done saving to local")
195
+
196
+ if push_to_hub:
197
+ trainer.push_to_hub(commit_message="Training complete")
198
+ print("Done pushing push to hub")
199
+
200
+ def call_pipeline(self, local=False, path="", example=""):
201
+ if local:
202
+ model_checkpoint = ""
203
+ else:
204
+ model_checkpoint = path
205
+ mask_filler = pipeline(
206
+ "fill-mask",
207
+ model=model_checkpoint,
208
+ )
209
+ print(mask_filler(example))
210
+
211
+ if __name__ == "__main__":
212
+ '''
213
+ 1_LOADING DATASET
214
+ '''
215
+ mlm = MaskedLM()
216
+ mlm.load_dataset()
217
+ print("-"*50, "Exploring information of Supporting", "-"*50)
218
+ mlm.load_support()
219
+ print("-"*50, "Exploring information of Supporting", "-"*50)
220
+ '''
221
+ 2_EXPLORING DATASET, MODEL
222
+ '''
223
+ print("-"*50, "Exploring some information of Model", "-"*50)
224
+ mlm.explore_infoModel()
225
+ print("-"*50, "Exploring some information of Model", "-"*50)
226
+ print("Example[0] (text) in dataset: ", mlm.get_feature_items(set="train", index=0, feature="text")[:100] + "...")
227
+ print("Example[0] (label) in dataset: ", mlm.get_feature_items(set="train", index=0, feature="label"))
228
+ line1, line2 = mlm.get_pair_items(set="train", index=1, feature1="text", feature2="label")
229
+ print("--> Inp of Example[1]: ", line1[:20] + "...")
230
+ print("--> Out of Example[1]: ", line2[:20]+ "...")
231
+ '''
232
+ 3_PRE-PROCESSING DATASET, COMPUTE METRICS
233
+ '''
234
+ tokens, word_ids = mlm.get_tokenizer(set="train", index=0, feature="text")
235
+ print("Tokens List of Example 0: ",tokens)
236
+ print("Word IDs List of Example 0: ",word_ids)
237
+ mlm.map_tokenize_dataset()
238
+ mlm.map_chunk_dataset()
239
+ mlm.dataset_split()
240
+ '''
241
+ 4_INITIALIZATION MODEL
242
+ '''
243
+ print("-"*50, f"Information of {mlm.model_checkpoint}", "-"*50)
244
+ mlm.create_model()
245
+ print("-"*50, f"Information of {mlm.model_checkpoint}", "-"*50)
246
+ '''
247
+ 5_SELECTION HYPERPARMETERS
248
+ '''
249
+ mlm.create_argumentTrainer(push_to_hub=True, hub_model_id="Chessmen/"+"fine_tune_" + mlm.model_checkpoint)
250
+ mlm.call_train(save_local=True,push_to_hub=True)
251
+ '''
252
+ 6_USE PRE-TRAINED MODEL
253
+ '''
254
+ mlm.call_pipeline(path="Chessmen/fine_tune_distilbert-base-uncased",example= "This is a great [MASK].")
255
+
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a6c4f97710c4144ef17d29c0346b8aa3b4e5feaef5f15349da25563bc08d7899
3
+ size 267954768
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d300f8cac52844a5f0c8280c2941366a386056d55a3cc40165c5874f1b7cd5b2
3
+ size 5240
vocab.txt ADDED
The diff for this file is too large to render. See raw diff