Spaces:
Sleeping
Sleeping
Geetansh
commited on
Commit
•
8553eb4
1
Parent(s):
52916f3
added jupyter nbs for reference in future
Browse files- ml_engine/ml-model-development-for-title-detection/data_for_model/Sentences_200.csv +199 -0
- ml_engine/ml-model-development-for-title-detection/requirements__.txt +7 -0
- ml_engine/ml-model-development-for-title-detection/saved_model/added_tokens.json +3 -0
- ml_engine/ml-model-development-for-title-detection/saved_model/config.json +41 -0
- ml_engine/ml-model-development-for-title-detection/saved_model/model.safetensors +3 -0
- ml_engine/ml-model-development-for-title-detection/saved_model/special_tokens_map.json +15 -0
- ml_engine/ml-model-development-for-title-detection/saved_model/spm.model +3 -0
- ml_engine/ml-model-development-for-title-detection/saved_model/tokenizer.json +0 -0
- ml_engine/ml-model-development-for-title-detection/saved_model/tokenizer_config.json +58 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/added_tokens.json +3 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/config.json +41 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/model.safetensors +3 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/special_tokens_map.json +15 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/spm.model +3 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/tokenizer.json +0 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/tokenizer_config.json +58 -0
- ml_engine/ml-model-development-for-title-detection/saved_model2/training_args.bin +3 -0
- ml_engine/ml-model-development-for-title-detection/try1_29Oct.ipynb +787 -0
- ml_engine/ml-model-development-for-title-detection/try1_29Oct_ExportingModel.ipynb +949 -0
ml_engine/ml-model-development-for-title-detection/data_for_model/Sentences_200.csv
ADDED
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
S.No.,Sentence,Label
|
2 |
+
1,Introduction to Quantum Mechanics,1
|
3 |
+
2,"In this chapter, we explore the foundational principles",0
|
4 |
+
3,The Rise and Fall of Civilizations,1
|
5 |
+
4,Historical records reveal the complex trajectory of empire-building.,0
|
6 |
+
5,Part III: Advanced Mathematical Concepts,1
|
7 |
+
6,A theorem is a statement that has been proven,0
|
8 |
+
7,Act 1: A Stormy Night,1
|
9 |
+
8,Thunder echoes across the stage as the scene opens,0
|
10 |
+
9,Section 5: Fluid Dynamics in Mechanical Engineering,1
|
11 |
+
10,"In fluid dynamics, the flow behavior is analyzed through Reynolds number",0
|
12 |
+
11,The Economic Implications of Inflation,0.8
|
13 |
+
12,Inflation affects purchasing power and currency stability,0
|
14 |
+
13,Module 4: Digital Signal Processing Techniques,1
|
15 |
+
14,Signal processing helps in analyzing discrete-time signals,0
|
16 |
+
15,Chapter 7: Theories of Social Contracts,1
|
17 |
+
16,Philosophers like Rousseau and Hobbes developed ideas,0
|
18 |
+
17,Film Theory and Cinematic Language,0.8
|
19 |
+
18,Cinematic language refers to the techniques used by filmmakers.,0
|
20 |
+
19,Unit 6: Cellular Biology and Genetics,1
|
21 |
+
20,The study of cells is fundamental to understanding genetics.,0
|
22 |
+
21,Painting Styles in the Renaissance Era,0.9
|
23 |
+
22,Renaissance artists emphasized realism and human emotion,0
|
24 |
+
23,Table of Contents,1
|
25 |
+
24,1. Introduction ................................... 1,1
|
26 |
+
25,Axioms in Set Theory,1
|
27 |
+
26,"In set theory, axioms provide a foundational framework",0
|
28 |
+
27,Research Methods in Political Science,0.7
|
29 |
+
28,Political scientists use qualitative and quantitative methods,0
|
30 |
+
29,Preface,0.9
|
31 |
+
30,This book is a reflection on the decades of research,0
|
32 |
+
31,Finally some relief!,0.8
|
33 |
+
32,A storm brews over the silent town,0.8
|
34 |
+
33,The Beginning of an End,0.8
|
35 |
+
34,Sarah gets a birthday gift.,0.5
|
36 |
+
35,An Unexpected Visitor,0.8
|
37 |
+
36,Should wizard hit mommy?,0.6
|
38 |
+
37,A decision is never truly final.,0.5
|
39 |
+
38,Victory at Dawn,0.75
|
40 |
+
39,The enemy was nowhere to be seen,0.3
|
41 |
+
40,Beyond the Mountains,0.8
|
42 |
+
41,The candle flickered and then died,0.3
|
43 |
+
42,Breaking the Chains,0.7
|
44 |
+
43,Life under a microscope,0.5
|
45 |
+
44,Strange Encounters,0.8
|
46 |
+
45,The more things change,0.2
|
47 |
+
46,in pursuit of the truth,0.2
|
48 |
+
47,Lost in translation,0.75
|
49 |
+
48,A Journey Through Time,0.85
|
50 |
+
49,Time stood still as she opened the door,0.1
|
51 |
+
50,The Science of Everyday Life,0.7
|
52 |
+
51,A tale of two cities,0.8
|
53 |
+
52,The boy stared at the blank page,0.3
|
54 |
+
53,Winter's cold embrace,0.55
|
55 |
+
54,"Chapter closed, or so they thought",0.6
|
56 |
+
55,Every ending is a new beginning,0.6
|
57 |
+
56,Mysterious phone calls at midnight,0.7
|
58 |
+
57,What lies beyond the stars?,0.65
|
59 |
+
58,The art of survival,0.7
|
60 |
+
59,And then the lights went out,0.4
|
61 |
+
60,How far is too far?,0.5
|
62 |
+
61,Finally peace arrived after months of turmoil,0.45
|
63 |
+
62,A Glimpse into the Unknown,0.75
|
64 |
+
65,Unfolding the Secrets of Ancient Civilizations,0.8
|
65 |
+
66,Rain poured heavily drenching the quiet town,0.35
|
66 |
+
67,Moments Before the War Began,0.7
|
67 |
+
68,In the world of quantum physics reality takes a strange turn,0.3
|
68 |
+
69,The Symphony of Broken Dreams,0.9
|
69 |
+
70,And so they marched forward with hope in their hearts,0.2
|
70 |
+
71,Whispers from the Past,0.8
|
71 |
+
72,In the middle of chaos he found his true calling,0.3
|
72 |
+
73,Reflections on a Life Well Lived,0.85
|
73 |
+
74,Despite the odds they won,0.45
|
74 |
+
75,The Lost Kingdoms of History,0.8
|
75 |
+
76,Beneath the ruins ancient artifacts lay hidden,0.3
|
76 |
+
77,On the Road to Redemption,0.8
|
77 |
+
78,In the dense forest silence reigned supreme,0.35
|
78 |
+
79,When Machines Begin to Think,0.8
|
79 |
+
80,A calm breeze swept through the deserted streets,0.3
|
80 |
+
81,Echoes of a Distant Era,0.8
|
81 |
+
82,The clock struck midnight and everything changed,0.4
|
82 |
+
83,Towards a Better Future,0.85
|
83 |
+
84,It was a journey unlike any other,0.3
|
84 |
+
85,The Paradox of Choice,0.7
|
85 |
+
86,A Journey to the Center of the Earth,0.8
|
86 |
+
87,Amidst the confusion she made a life-changing decision,0.3
|
87 |
+
88,What Comes After Victory?,0.6
|
88 |
+
89,The Final Frontier of Human Exploration,0.9
|
89 |
+
90,The stars twinkled brightly as if telling their own story,0.35
|
90 |
+
91,In Pursuit of Happiness,0.5
|
91 |
+
92,Beyond the Veil of Reality,0.85
|
92 |
+
93,A Tale of Two Cities,0.9
|
93 |
+
94,Through the fire he emerged unscathed,0.4
|
94 |
+
95,A New Era Dawns,0.75
|
95 |
+
96,At the heart of the problem lies a simple truth,0.5
|
96 |
+
97,The Enigma of Time,0.75
|
97 |
+
98,Beneath the Ocean Waves,0.6
|
98 |
+
99,She smiled knowing that the worst was behind them,0.45
|
99 |
+
100,Fragments of Forgotten Memories,0.9
|
100 |
+
101,A night to remember,0.8
|
101 |
+
102,They whispered secrets only the moon could hear,0.3
|
102 |
+
103,The forgotten art of storytelling,0.8
|
103 |
+
104,In the lab they observed the anomaly under the microscope,0
|
104 |
+
105,Moments like these don't last forever,0
|
105 |
+
106,At last the code compiled without errors,0
|
106 |
+
107,Breaking the chains of tradition,0.85
|
107 |
+
108,Through the glass she saw a world she couldn't reach,0.2
|
108 |
+
109,The edge of reality,0.9
|
109 |
+
110,There was something haunting about the way he said goodbye,0.2
|
110 |
+
111,What could have been?,0.65
|
111 |
+
112,The evolution of scientific thought,0.8
|
112 |
+
113,With one step forward he plunged into the unknown,0.4
|
113 |
+
114,A new perspective on time and space,0.6
|
114 |
+
115,She knew the answer but hesitated,0.2
|
115 |
+
116,Unlocking the power of mindfulness,0.75
|
116 |
+
117,And just like that the story ended,0.4
|
117 |
+
118,The untold stories of revolution,0.8
|
118 |
+
119,In silence they found their answers,0.3
|
119 |
+
120,Whispers of forgotten places,0.9
|
120 |
+
121,The child smiled at her reflection in the puddle,0.35
|
121 |
+
122,What lies beneath the surface?,0.55
|
122 |
+
123,The legacy of great minds,0.85
|
123 |
+
124,With every sunrise hope is renewed,0.55
|
124 |
+
125,Into the heart of darkness,0.8
|
125 |
+
126,The machine whirred to life emitting strange sounds,0.1
|
126 |
+
127,A journey through time and memory,0.9
|
127 |
+
128,She opened the letter and gasped,0.35
|
128 |
+
129,The philosophy of everyday life,0.75
|
129 |
+
130,Standing at the crossroads he made his choice,0.1
|
130 |
+
131,Beyond the boundaries of imagination,0.85
|
131 |
+
132,They laughed unaware of what was coming next,0.3
|
132 |
+
133,On the brink of discovery,0.8
|
133 |
+
134,A soft breeze rustled the pages of an open book,0.2
|
134 |
+
135,The power of an idea,0.65
|
135 |
+
136,What if this was the end?,0.7
|
136 |
+
137,In search of answers he traveled far and wide,0.3
|
137 |
+
138,The sound of silence,0.85
|
138 |
+
139,Rain began to fall washing away the traces of the day,0.2
|
139 |
+
140,An experiment in thought,0.7
|
140 |
+
141,They watched as history unfolded before their eyes,0.2
|
141 |
+
142,The path less taken,0.8
|
142 |
+
143,A mystery lurked in the shadows of the old library,0.45
|
143 |
+
144,The birth of a new era,0.85
|
144 |
+
145,Lost in the rhythm of the music she danced,0.3
|
145 |
+
146,What does the future hold?,0.65
|
146 |
+
147,A chronicle of unseen events,1
|
147 |
+
148,She found herself in a place she didn't recognize,0.3
|
148 |
+
149,Where the wild things are,1
|
149 |
+
150,A single light flickered in the distance,0.35
|
150 |
+
151,Conversations at the edge of reality,0.7
|
151 |
+
152,He looked at the stars and wondered,0.3
|
152 |
+
153,Reimagining the world through art,0.85
|
153 |
+
154,A faint smile crossed her lips,0
|
154 |
+
155,The geometry of the universe,0.9
|
155 |
+
156,There was a strange beauty in the chaos,0.2
|
156 |
+
157,Echoes from the future,0.85
|
157 |
+
158,In the end they found what they were looking for,0
|
158 |
+
159,The dawn of artificial consciousness,0.85
|
159 |
+
160,The forest whispered secrets only the wind could understand,0.1
|
160 |
+
161,What remains unsaid?,0.65
|
161 |
+
162,The art of forgetting,0.8
|
162 |
+
163,A sudden chill filled the room,0.2
|
163 |
+
164,In search of the infinite,0.85
|
164 |
+
165,The door creaked open revealing a dark corridor,0.2
|
165 |
+
166,Fragments of an imagined life,0.75
|
166 |
+
167,The numbers didn't add up something was missing,0.2
|
167 |
+
168,The shape of dreams,0.85
|
168 |
+
169,And so the night ended,0.5
|
169 |
+
170,The quantum nature of reality,0.9
|
170 |
+
171,The clock ticked away marking the end of an era,0.45
|
171 |
+
172,Reflections in a broken mirror,0.7
|
172 |
+
173,He held the key but did he know it?,0.2
|
173 |
+
174,The illusion of control,0.75
|
174 |
+
175,In the ashes they found hope,0.4
|
175 |
+
176,The anatomy of a revolution,0.85
|
176 |
+
177,It was a place where reality seemed to bend,0.2
|
177 |
+
178,A journey through parallel worlds,0.9
|
178 |
+
179,The experiment had unexpected consequences,0.3
|
179 |
+
180,Whispers from the multiverse,0.85
|
180 |
+
181,Lost in translation,0.75
|
181 |
+
182,She waited hoping for an answer,0.2
|
182 |
+
183,The mechanics of thought,0.8
|
183 |
+
184,They stood at the edge of infinity,0.35
|
184 |
+
185,Exploring the unknown,0.85
|
185 |
+
186,Every ending is a new beginning,0.5
|
186 |
+
187,The sound of forgotten names,0.9
|
187 |
+
188,He closed the door leaving the past behind,0.2
|
188 |
+
189,Visions of a distant future,0.8
|
189 |
+
190,In the blink of an eye everything changed,0.4
|
190 |
+
191,The end of an illusion,0.85
|
191 |
+
192,The stars seemed closer than ever,0.2
|
192 |
+
193,Beyond the horizon,0.75
|
193 |
+
194,And so they ventured forth,0.5
|
194 |
+
195,The limits of human knowledge,0.85
|
195 |
+
196,A silent tear escaped her eye,0.2
|
196 |
+
197,Reflections on infinity,0.9
|
197 |
+
198,He knew the answer but it didn't matter anymore,0
|
198 |
+
199,The mystery of the missing piece,1
|
199 |
+
200,A new story was waiting to be told,0.6
|
ml_engine/ml-model-development-for-title-detection/requirements__.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
datasets==3.0.2
|
2 |
+
transformers==4.46.0
|
3 |
+
pandas==2.2.3
|
4 |
+
numpy==2.0.2
|
5 |
+
sentencepiece==0.2.0
|
6 |
+
tiktoken==0.8.0
|
7 |
+
torch==2.5.1+cu118
|
ml_engine/ml-model-development-for-title-detection/saved_model/added_tokens.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"[MASK]": 128000
|
3 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model/config.json
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "microsoft/deberta-v3-small",
|
3 |
+
"architectures": [
|
4 |
+
"DebertaV2ForSequenceClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"hidden_act": "gelu",
|
8 |
+
"hidden_dropout_prob": 0.1,
|
9 |
+
"hidden_size": 768,
|
10 |
+
"id2label": {
|
11 |
+
"0": "LABEL_0"
|
12 |
+
},
|
13 |
+
"initializer_range": 0.02,
|
14 |
+
"intermediate_size": 3072,
|
15 |
+
"label2id": {
|
16 |
+
"LABEL_0": 0
|
17 |
+
},
|
18 |
+
"layer_norm_eps": 1e-07,
|
19 |
+
"max_position_embeddings": 512,
|
20 |
+
"max_relative_positions": -1,
|
21 |
+
"model_type": "deberta-v2",
|
22 |
+
"norm_rel_ebd": "layer_norm",
|
23 |
+
"num_attention_heads": 12,
|
24 |
+
"num_hidden_layers": 6,
|
25 |
+
"pad_token_id": 0,
|
26 |
+
"pooler_dropout": 0,
|
27 |
+
"pooler_hidden_act": "gelu",
|
28 |
+
"pooler_hidden_size": 768,
|
29 |
+
"pos_att_type": [
|
30 |
+
"p2c",
|
31 |
+
"c2p"
|
32 |
+
],
|
33 |
+
"position_biased_input": false,
|
34 |
+
"position_buckets": 256,
|
35 |
+
"relative_attention": true,
|
36 |
+
"share_att_key": true,
|
37 |
+
"torch_dtype": "float32",
|
38 |
+
"transformers_version": "4.46.0",
|
39 |
+
"type_vocab_size": 0,
|
40 |
+
"vocab_size": 128100
|
41 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model/model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e941da07dd4a4f884aae6082850988e36acfdb9a10cffa21922bd68c7bf20606
|
3 |
+
size 567595468
|
ml_engine/ml-model-development-for-title-detection/saved_model/special_tokens_map.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": "[CLS]",
|
3 |
+
"cls_token": "[CLS]",
|
4 |
+
"eos_token": "[SEP]",
|
5 |
+
"mask_token": "[MASK]",
|
6 |
+
"pad_token": "[PAD]",
|
7 |
+
"sep_token": "[SEP]",
|
8 |
+
"unk_token": {
|
9 |
+
"content": "[UNK]",
|
10 |
+
"lstrip": false,
|
11 |
+
"normalized": true,
|
12 |
+
"rstrip": false,
|
13 |
+
"single_word": false
|
14 |
+
}
|
15 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model/spm.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c679fbf93643d19aab7ee10c0b99e460bdbc02fedf34b92b05af343b4af586fd
|
3 |
+
size 2464616
|
ml_engine/ml-model-development-for-title-detection/saved_model/tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
ml_engine/ml-model-development-for-title-detection/saved_model/tokenizer_config.json
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "[PAD]",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false,
|
9 |
+
"special": true
|
10 |
+
},
|
11 |
+
"1": {
|
12 |
+
"content": "[CLS]",
|
13 |
+
"lstrip": false,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"single_word": false,
|
17 |
+
"special": true
|
18 |
+
},
|
19 |
+
"2": {
|
20 |
+
"content": "[SEP]",
|
21 |
+
"lstrip": false,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"single_word": false,
|
25 |
+
"special": true
|
26 |
+
},
|
27 |
+
"3": {
|
28 |
+
"content": "[UNK]",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": true,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"128000": {
|
36 |
+
"content": "[MASK]",
|
37 |
+
"lstrip": false,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
+
}
|
43 |
+
},
|
44 |
+
"bos_token": "[CLS]",
|
45 |
+
"clean_up_tokenization_spaces": false,
|
46 |
+
"cls_token": "[CLS]",
|
47 |
+
"do_lower_case": false,
|
48 |
+
"eos_token": "[SEP]",
|
49 |
+
"mask_token": "[MASK]",
|
50 |
+
"model_max_length": 1000000000000000019884624838656,
|
51 |
+
"pad_token": "[PAD]",
|
52 |
+
"sep_token": "[SEP]",
|
53 |
+
"sp_model_kwargs": {},
|
54 |
+
"split_by_punct": false,
|
55 |
+
"tokenizer_class": "DebertaV2Tokenizer",
|
56 |
+
"unk_token": "[UNK]",
|
57 |
+
"vocab_type": "spm"
|
58 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model2/added_tokens.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"[MASK]": 128000
|
3 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model2/config.json
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "microsoft/deberta-v3-small",
|
3 |
+
"architectures": [
|
4 |
+
"DebertaV2ForSequenceClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"hidden_act": "gelu",
|
8 |
+
"hidden_dropout_prob": 0.1,
|
9 |
+
"hidden_size": 768,
|
10 |
+
"id2label": {
|
11 |
+
"0": "LABEL_0"
|
12 |
+
},
|
13 |
+
"initializer_range": 0.02,
|
14 |
+
"intermediate_size": 3072,
|
15 |
+
"label2id": {
|
16 |
+
"LABEL_0": 0
|
17 |
+
},
|
18 |
+
"layer_norm_eps": 1e-07,
|
19 |
+
"max_position_embeddings": 512,
|
20 |
+
"max_relative_positions": -1,
|
21 |
+
"model_type": "deberta-v2",
|
22 |
+
"norm_rel_ebd": "layer_norm",
|
23 |
+
"num_attention_heads": 12,
|
24 |
+
"num_hidden_layers": 6,
|
25 |
+
"pad_token_id": 0,
|
26 |
+
"pooler_dropout": 0,
|
27 |
+
"pooler_hidden_act": "gelu",
|
28 |
+
"pooler_hidden_size": 768,
|
29 |
+
"pos_att_type": [
|
30 |
+
"p2c",
|
31 |
+
"c2p"
|
32 |
+
],
|
33 |
+
"position_biased_input": false,
|
34 |
+
"position_buckets": 256,
|
35 |
+
"relative_attention": true,
|
36 |
+
"share_att_key": true,
|
37 |
+
"torch_dtype": "float32",
|
38 |
+
"transformers_version": "4.46.0",
|
39 |
+
"type_vocab_size": 0,
|
40 |
+
"vocab_size": 128100
|
41 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model2/model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e941da07dd4a4f884aae6082850988e36acfdb9a10cffa21922bd68c7bf20606
|
3 |
+
size 567595468
|
ml_engine/ml-model-development-for-title-detection/saved_model2/special_tokens_map.json
ADDED
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": "[CLS]",
|
3 |
+
"cls_token": "[CLS]",
|
4 |
+
"eos_token": "[SEP]",
|
5 |
+
"mask_token": "[MASK]",
|
6 |
+
"pad_token": "[PAD]",
|
7 |
+
"sep_token": "[SEP]",
|
8 |
+
"unk_token": {
|
9 |
+
"content": "[UNK]",
|
10 |
+
"lstrip": false,
|
11 |
+
"normalized": true,
|
12 |
+
"rstrip": false,
|
13 |
+
"single_word": false
|
14 |
+
}
|
15 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model2/spm.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c679fbf93643d19aab7ee10c0b99e460bdbc02fedf34b92b05af343b4af586fd
|
3 |
+
size 2464616
|
ml_engine/ml-model-development-for-title-detection/saved_model2/tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
ml_engine/ml-model-development-for-title-detection/saved_model2/tokenizer_config.json
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "[PAD]",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false,
|
9 |
+
"special": true
|
10 |
+
},
|
11 |
+
"1": {
|
12 |
+
"content": "[CLS]",
|
13 |
+
"lstrip": false,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"single_word": false,
|
17 |
+
"special": true
|
18 |
+
},
|
19 |
+
"2": {
|
20 |
+
"content": "[SEP]",
|
21 |
+
"lstrip": false,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"single_word": false,
|
25 |
+
"special": true
|
26 |
+
},
|
27 |
+
"3": {
|
28 |
+
"content": "[UNK]",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": true,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"128000": {
|
36 |
+
"content": "[MASK]",
|
37 |
+
"lstrip": false,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
+
}
|
43 |
+
},
|
44 |
+
"bos_token": "[CLS]",
|
45 |
+
"clean_up_tokenization_spaces": false,
|
46 |
+
"cls_token": "[CLS]",
|
47 |
+
"do_lower_case": false,
|
48 |
+
"eos_token": "[SEP]",
|
49 |
+
"mask_token": "[MASK]",
|
50 |
+
"model_max_length": 1000000000000000019884624838656,
|
51 |
+
"pad_token": "[PAD]",
|
52 |
+
"sep_token": "[SEP]",
|
53 |
+
"sp_model_kwargs": {},
|
54 |
+
"split_by_punct": false,
|
55 |
+
"tokenizer_class": "DebertaV2Tokenizer",
|
56 |
+
"unk_token": "[UNK]",
|
57 |
+
"vocab_type": "spm"
|
58 |
+
}
|
ml_engine/ml-model-development-for-title-detection/saved_model2/training_args.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4999cb1241c88dc93c2687149f19169521292eb1e5ca325d3a244469bb1602f9
|
3 |
+
size 5176
|
ml_engine/ml-model-development-for-title-detection/try1_29Oct.ipynb
ADDED
@@ -0,0 +1,787 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"metadata": {},
|
6 |
+
"source": [
|
7 |
+
"# Imports"
|
8 |
+
]
|
9 |
+
},
|
10 |
+
{
|
11 |
+
"cell_type": "code",
|
12 |
+
"execution_count": 65,
|
13 |
+
"metadata": {},
|
14 |
+
"outputs": [],
|
15 |
+
"source": [
|
16 |
+
"# imports\n",
|
17 |
+
"import pandas as pd\n",
|
18 |
+
"import numpy as np\n",
|
19 |
+
"# import matplotlib as plt\n",
|
20 |
+
"import random as rn\n",
|
21 |
+
"import os\n",
|
22 |
+
"os.environ['PYTHONHASHSEED'] = '0'\n",
|
23 |
+
"os.environ['CUDA_VISIBLE_DEVICES'] = ''\n",
|
24 |
+
"np.random.seed(37)\n",
|
25 |
+
"rn.seed(1254)"
|
26 |
+
]
|
27 |
+
},
|
28 |
+
{
|
29 |
+
"cell_type": "markdown",
|
30 |
+
"metadata": {},
|
31 |
+
"source": [
|
32 |
+
"# Load data, train, test, validation splits"
|
33 |
+
]
|
34 |
+
},
|
35 |
+
{
|
36 |
+
"cell_type": "code",
|
37 |
+
"execution_count": 67,
|
38 |
+
"metadata": {},
|
39 |
+
"outputs": [
|
40 |
+
{
|
41 |
+
"name": "stdout",
|
42 |
+
"output_type": "stream",
|
43 |
+
"text": [
|
44 |
+
"<class 'pandas.core.frame.DataFrame'>\n"
|
45 |
+
]
|
46 |
+
},
|
47 |
+
{
|
48 |
+
"data": {
|
49 |
+
"text/html": [
|
50 |
+
"<div>\n",
|
51 |
+
"<style scoped>\n",
|
52 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
53 |
+
" vertical-align: middle;\n",
|
54 |
+
" }\n",
|
55 |
+
"\n",
|
56 |
+
" .dataframe tbody tr th {\n",
|
57 |
+
" vertical-align: top;\n",
|
58 |
+
" }\n",
|
59 |
+
"\n",
|
60 |
+
" .dataframe thead th {\n",
|
61 |
+
" text-align: right;\n",
|
62 |
+
" }\n",
|
63 |
+
"</style>\n",
|
64 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
65 |
+
" <thead>\n",
|
66 |
+
" <tr style=\"text-align: right;\">\n",
|
67 |
+
" <th></th>\n",
|
68 |
+
" <th>Sentence</th>\n",
|
69 |
+
" <th>Label</th>\n",
|
70 |
+
" </tr>\n",
|
71 |
+
" <tr>\n",
|
72 |
+
" <th>S.No.</th>\n",
|
73 |
+
" <th></th>\n",
|
74 |
+
" <th></th>\n",
|
75 |
+
" </tr>\n",
|
76 |
+
" </thead>\n",
|
77 |
+
" <tbody>\n",
|
78 |
+
" <tr>\n",
|
79 |
+
" <th>1</th>\n",
|
80 |
+
" <td>Introduction to Quantum Mechanics</td>\n",
|
81 |
+
" <td>1.0</td>\n",
|
82 |
+
" </tr>\n",
|
83 |
+
" <tr>\n",
|
84 |
+
" <th>2</th>\n",
|
85 |
+
" <td>In this chapter, we explore the foundational p...</td>\n",
|
86 |
+
" <td>0.0</td>\n",
|
87 |
+
" </tr>\n",
|
88 |
+
" <tr>\n",
|
89 |
+
" <th>3</th>\n",
|
90 |
+
" <td>The Rise and Fall of Civilizations</td>\n",
|
91 |
+
" <td>1.0</td>\n",
|
92 |
+
" </tr>\n",
|
93 |
+
" <tr>\n",
|
94 |
+
" <th>4</th>\n",
|
95 |
+
" <td>Historical records reveal the complex trajecto...</td>\n",
|
96 |
+
" <td>0.0</td>\n",
|
97 |
+
" </tr>\n",
|
98 |
+
" <tr>\n",
|
99 |
+
" <th>5</th>\n",
|
100 |
+
" <td>Part III: Advanced Mathematical Concepts</td>\n",
|
101 |
+
" <td>1.0</td>\n",
|
102 |
+
" </tr>\n",
|
103 |
+
" </tbody>\n",
|
104 |
+
"</table>\n",
|
105 |
+
"</div>"
|
106 |
+
],
|
107 |
+
"text/plain": [
|
108 |
+
" Sentence Label\n",
|
109 |
+
"S.No. \n",
|
110 |
+
"1 Introduction to Quantum Mechanics 1.0\n",
|
111 |
+
"2 In this chapter, we explore the foundational p... 0.0\n",
|
112 |
+
"3 The Rise and Fall of Civilizations 1.0\n",
|
113 |
+
"4 Historical records reveal the complex trajecto... 0.0\n",
|
114 |
+
"5 Part III: Advanced Mathematical Concepts 1.0"
|
115 |
+
]
|
116 |
+
},
|
117 |
+
"metadata": {},
|
118 |
+
"output_type": "display_data"
|
119 |
+
},
|
120 |
+
{
|
121 |
+
"data": {
|
122 |
+
"text/html": [
|
123 |
+
"<div>\n",
|
124 |
+
"<style scoped>\n",
|
125 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
126 |
+
" vertical-align: middle;\n",
|
127 |
+
" }\n",
|
128 |
+
"\n",
|
129 |
+
" .dataframe tbody tr th {\n",
|
130 |
+
" vertical-align: top;\n",
|
131 |
+
" }\n",
|
132 |
+
"\n",
|
133 |
+
" .dataframe thead th {\n",
|
134 |
+
" text-align: right;\n",
|
135 |
+
" }\n",
|
136 |
+
"</style>\n",
|
137 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
138 |
+
" <thead>\n",
|
139 |
+
" <tr style=\"text-align: right;\">\n",
|
140 |
+
" <th></th>\n",
|
141 |
+
" <th>Label</th>\n",
|
142 |
+
" </tr>\n",
|
143 |
+
" </thead>\n",
|
144 |
+
" <tbody>\n",
|
145 |
+
" <tr>\n",
|
146 |
+
" <th>count</th>\n",
|
147 |
+
" <td>198.000000</td>\n",
|
148 |
+
" </tr>\n",
|
149 |
+
" <tr>\n",
|
150 |
+
" <th>mean</th>\n",
|
151 |
+
" <td>0.555051</td>\n",
|
152 |
+
" </tr>\n",
|
153 |
+
" <tr>\n",
|
154 |
+
" <th>std</th>\n",
|
155 |
+
" <td>0.313770</td>\n",
|
156 |
+
" </tr>\n",
|
157 |
+
" <tr>\n",
|
158 |
+
" <th>min</th>\n",
|
159 |
+
" <td>0.000000</td>\n",
|
160 |
+
" </tr>\n",
|
161 |
+
" <tr>\n",
|
162 |
+
" <th>25%</th>\n",
|
163 |
+
" <td>0.300000</td>\n",
|
164 |
+
" </tr>\n",
|
165 |
+
" <tr>\n",
|
166 |
+
" <th>50%</th>\n",
|
167 |
+
" <td>0.650000</td>\n",
|
168 |
+
" </tr>\n",
|
169 |
+
" <tr>\n",
|
170 |
+
" <th>75%</th>\n",
|
171 |
+
" <td>0.800000</td>\n",
|
172 |
+
" </tr>\n",
|
173 |
+
" <tr>\n",
|
174 |
+
" <th>max</th>\n",
|
175 |
+
" <td>1.000000</td>\n",
|
176 |
+
" </tr>\n",
|
177 |
+
" </tbody>\n",
|
178 |
+
"</table>\n",
|
179 |
+
"</div>"
|
180 |
+
],
|
181 |
+
"text/plain": [
|
182 |
+
" Label\n",
|
183 |
+
"count 198.000000\n",
|
184 |
+
"mean 0.555051\n",
|
185 |
+
"std 0.313770\n",
|
186 |
+
"min 0.000000\n",
|
187 |
+
"25% 0.300000\n",
|
188 |
+
"50% 0.650000\n",
|
189 |
+
"75% 0.800000\n",
|
190 |
+
"max 1.000000"
|
191 |
+
]
|
192 |
+
},
|
193 |
+
"metadata": {},
|
194 |
+
"output_type": "display_data"
|
195 |
+
},
|
196 |
+
{
|
197 |
+
"data": {
|
198 |
+
"text/plain": [
|
199 |
+
"(198, 2)"
|
200 |
+
]
|
201 |
+
},
|
202 |
+
"metadata": {},
|
203 |
+
"output_type": "display_data"
|
204 |
+
}
|
205 |
+
],
|
206 |
+
"source": [
|
207 |
+
"# EDA\n",
|
208 |
+
"path_to_data = \"./data/Sentences_200.csv\"\n",
|
209 |
+
"new_data_5_cat = pd.read_csv(path_to_data, index_col='S.No.')\n",
|
210 |
+
"print(type(new_data_5_cat))\n",
|
211 |
+
"display(new_data_5_cat.head())\n",
|
212 |
+
"display(new_data_5_cat.describe())\n",
|
213 |
+
"display(new_data_5_cat.shape)"
|
214 |
+
]
|
215 |
+
},
|
216 |
+
{
|
217 |
+
"cell_type": "code",
|
218 |
+
"execution_count": 68,
|
219 |
+
"metadata": {},
|
220 |
+
"outputs": [
|
221 |
+
{
|
222 |
+
"data": {
|
223 |
+
"text/plain": [
|
224 |
+
"Dataset({\n",
|
225 |
+
" features: ['Sentence', 'Label', 'S.No.'],\n",
|
226 |
+
" num_rows: 160\n",
|
227 |
+
"})"
|
228 |
+
]
|
229 |
+
},
|
230 |
+
"metadata": {},
|
231 |
+
"output_type": "display_data"
|
232 |
+
},
|
233 |
+
{
|
234 |
+
"data": {
|
235 |
+
"text/plain": [
|
236 |
+
"Dataset({\n",
|
237 |
+
" features: ['Sentence', 'Label', 'S.No.'],\n",
|
238 |
+
" num_rows: 20\n",
|
239 |
+
"})"
|
240 |
+
]
|
241 |
+
},
|
242 |
+
"metadata": {},
|
243 |
+
"output_type": "display_data"
|
244 |
+
},
|
245 |
+
{
|
246 |
+
"data": {
|
247 |
+
"text/plain": [
|
248 |
+
"Dataset({\n",
|
249 |
+
" features: ['Sentence', 'Label', 'S.No.'],\n",
|
250 |
+
" num_rows: 18\n",
|
251 |
+
"})"
|
252 |
+
]
|
253 |
+
},
|
254 |
+
"metadata": {},
|
255 |
+
"output_type": "display_data"
|
256 |
+
}
|
257 |
+
],
|
258 |
+
"source": [
|
259 |
+
"# Make test, train, cv splits\n",
|
260 |
+
"from datasets import Dataset\n",
|
261 |
+
"ds = Dataset.from_pandas(new_data_5_cat)\n",
|
262 |
+
"\n",
|
263 |
+
"ds_train_temp_dict = ds.train_test_split(train_size=160)\n",
|
264 |
+
"ds_train = ds_train_temp_dict['train']\n",
|
265 |
+
"ds_test_cv_dict = ds_train_temp_dict['test'].train_test_split(test_size=20)\n",
|
266 |
+
"ds_cv = ds_test_cv_dict['train']\n",
|
267 |
+
"ds_test = ds_test_cv_dict['test']\n",
|
268 |
+
"display(ds_train)\n",
|
269 |
+
"display(ds_test)\n",
|
270 |
+
"display(ds_cv)"
|
271 |
+
]
|
272 |
+
},
|
273 |
+
{
|
274 |
+
"cell_type": "markdown",
|
275 |
+
"metadata": {},
|
276 |
+
"source": [
|
277 |
+
"# Fine tune LLM"
|
278 |
+
]
|
279 |
+
},
|
280 |
+
{
|
281 |
+
"cell_type": "code",
|
282 |
+
"execution_count": 69,
|
283 |
+
"metadata": {},
|
284 |
+
"outputs": [
|
285 |
+
{
|
286 |
+
"name": "stderr",
|
287 |
+
"output_type": "stream",
|
288 |
+
"text": [
|
289 |
+
"c:\\Users\\Geetansh\\Desktop\\New_folder\\venv\\Lib\\site-packages\\transformers\\convert_slow_tokenizer.py:561: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.\n",
|
290 |
+
" warnings.warn(\n"
|
291 |
+
]
|
292 |
+
},
|
293 |
+
{
|
294 |
+
"data": {
|
295 |
+
"text/plain": [
|
296 |
+
"['▁My', '▁name', '▁is', '▁Geeta', 'n', 'sh', '▁Bhardwaj', '.']"
|
297 |
+
]
|
298 |
+
},
|
299 |
+
"execution_count": 69,
|
300 |
+
"metadata": {},
|
301 |
+
"output_type": "execute_result"
|
302 |
+
}
|
303 |
+
],
|
304 |
+
"source": [
|
305 |
+
"# Get Tokenizer\n",
|
306 |
+
"from transformers import AutoTokenizer\n",
|
307 |
+
"model_nm = 'microsoft/deberta-v3-small'\n",
|
308 |
+
"tokz = AutoTokenizer.from_pretrained(model_nm)\n",
|
309 |
+
"tokz.tokenize('My name is Geetansh Bhardwaj.')"
|
310 |
+
]
|
311 |
+
},
|
312 |
+
{
|
313 |
+
"cell_type": "code",
|
314 |
+
"execution_count": 70,
|
315 |
+
"metadata": {},
|
316 |
+
"outputs": [
|
317 |
+
{
|
318 |
+
"name": "stderr",
|
319 |
+
"output_type": "stream",
|
320 |
+
"text": [
|
321 |
+
"Map: 100%|██████████| 160/160 [00:00<00:00, 3348.83 examples/s]\n"
|
322 |
+
]
|
323 |
+
}
|
324 |
+
],
|
325 |
+
"source": [
|
326 |
+
"# Tokenize the 'Sentence' column\n",
|
327 |
+
"def tokenize_string(row):\n",
|
328 |
+
" return tokz(row['Sentence'])\n",
|
329 |
+
"\n",
|
330 |
+
"def tokenize_sentence_col(ds):\n",
|
331 |
+
" '''\n",
|
332 |
+
" We will tokenize the 'Sentence' column and add another column 'Sentence_id'. It will be used for fine-tuning\n",
|
333 |
+
" ds: a dataset with 'Sentence' column\n",
|
334 |
+
" '''\n",
|
335 |
+
"\n",
|
336 |
+
" tokenized_ds = ds.map(tokenize_string, batch_size=5)\n",
|
337 |
+
" return tokenized_ds\n",
|
338 |
+
"\n",
|
339 |
+
"tokenized_ds_train = tokenize_sentence_col(ds_train)"
|
340 |
+
]
|
341 |
+
},
|
342 |
+
{
|
343 |
+
"cell_type": "code",
|
344 |
+
"execution_count": 71,
|
345 |
+
"metadata": {},
|
346 |
+
"outputs": [
|
347 |
+
{
|
348 |
+
"name": "stderr",
|
349 |
+
"output_type": "stream",
|
350 |
+
"text": [
|
351 |
+
"Map: 100%|██████████| 18/18 [00:00<00:00, 1504.20 examples/s]\n"
|
352 |
+
]
|
353 |
+
}
|
354 |
+
],
|
355 |
+
"source": [
|
356 |
+
"# An undocumented fact: Transformers assume that your label column is named \"labels\". Ours is named \"Label\", so we will change that\n",
|
357 |
+
"tokenized_ds_train = tokenized_ds_train.rename_columns({'Label' : 'labels'})\n",
|
358 |
+
"tokenized_ds_train\n",
|
359 |
+
"\n",
|
360 |
+
"tokenized_ds_cv = tokenize_sentence_col(ds_cv)\n",
|
361 |
+
"tokenized_ds_cv = tokenized_ds_cv.rename_columns({'Label' : 'labels'})"
|
362 |
+
]
|
363 |
+
},
|
364 |
+
{
|
365 |
+
"cell_type": "code",
|
366 |
+
"execution_count": 72,
|
367 |
+
"metadata": {},
|
368 |
+
"outputs": [
|
369 |
+
{
|
370 |
+
"name": "stderr",
|
371 |
+
"output_type": "stream",
|
372 |
+
"text": [
|
373 |
+
"Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']\n",
|
374 |
+
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
|
375 |
+
]
|
376 |
+
}
|
377 |
+
],
|
378 |
+
"source": [
|
379 |
+
"# Get the model (We are actually using a pre-trained one)\n",
|
380 |
+
"from transformers import AutoModelForSequenceClassification\n",
|
381 |
+
"my_model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)"
|
382 |
+
]
|
383 |
+
},
|
384 |
+
{
|
385 |
+
"cell_type": "code",
|
386 |
+
"execution_count": 73,
|
387 |
+
"metadata": {},
|
388 |
+
"outputs": [
|
389 |
+
{
|
390 |
+
"name": "stderr",
|
391 |
+
"output_type": "stream",
|
392 |
+
"text": [
|
393 |
+
"c:\\Users\\Geetansh\\Desktop\\New_folder\\venv\\Lib\\site-packages\\transformers\\training_args.py:1559: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead\n",
|
394 |
+
" warnings.warn(\n",
|
395 |
+
"C:\\Users\\Geetansh\\AppData\\Local\\Temp\\ipykernel_6212\\1403743469.py:8: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.\n",
|
396 |
+
" trainer = Trainer(my_model, args, train_dataset=tokenized_ds_train, eval_dataset=tokenized_ds_cv,\n"
|
397 |
+
]
|
398 |
+
}
|
399 |
+
],
|
400 |
+
"source": [
|
401 |
+
"from transformers import TrainingArguments, Trainer\n",
|
402 |
+
"bs = 5\n",
|
403 |
+
"epochs = 4\n",
|
404 |
+
"lr = 8e-5\n",
|
405 |
+
"args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,\n",
|
406 |
+
" evaluation_strategy=\"epoch\", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,\n",
|
407 |
+
" num_train_epochs=epochs, weight_decay=0.01, report_to='none')\n",
|
408 |
+
"trainer = Trainer(my_model, args, train_dataset=tokenized_ds_train, eval_dataset=tokenized_ds_cv,\n",
|
409 |
+
" tokenizer=tokz)"
|
410 |
+
]
|
411 |
+
},
|
412 |
+
{
|
413 |
+
"cell_type": "code",
|
414 |
+
"execution_count": 74,
|
415 |
+
"metadata": {},
|
416 |
+
"outputs": [
|
417 |
+
{
|
418 |
+
"name": "stderr",
|
419 |
+
"output_type": "stream",
|
420 |
+
"text": [
|
421 |
+
" 0%| | 0/16 [1:22:50<?, ?it/s]\n",
|
422 |
+
" 25%|██▌ | 32/128 [01:00<02:37, 1.64s/it]\n",
|
423 |
+
" 25%|██▌ | 32/128 [01:01<02:37, 1.64s/it]"
|
424 |
+
]
|
425 |
+
},
|
426 |
+
{
|
427 |
+
"name": "stdout",
|
428 |
+
"output_type": "stream",
|
429 |
+
"text": [
|
430 |
+
"{'eval_loss': 0.13210749626159668, 'eval_runtime': 0.5116, 'eval_samples_per_second': 35.182, 'eval_steps_per_second': 3.909, 'epoch': 1.0}\n"
|
431 |
+
]
|
432 |
+
},
|
433 |
+
{
|
434 |
+
"name": "stderr",
|
435 |
+
"output_type": "stream",
|
436 |
+
"text": [
|
437 |
+
" 50%|█████ | 64/128 [01:55<01:58, 1.86s/it]\n",
|
438 |
+
" 50%|█████ | 64/128 [01:56<01:58, 1.86s/it]"
|
439 |
+
]
|
440 |
+
},
|
441 |
+
{
|
442 |
+
"name": "stdout",
|
443 |
+
"output_type": "stream",
|
444 |
+
"text": [
|
445 |
+
"{'eval_loss': 0.025790058076381683, 'eval_runtime': 0.5595, 'eval_samples_per_second': 32.171, 'eval_steps_per_second': 3.575, 'epoch': 2.0}\n"
|
446 |
+
]
|
447 |
+
},
|
448 |
+
{
|
449 |
+
"name": "stderr",
|
450 |
+
"output_type": "stream",
|
451 |
+
"text": [
|
452 |
+
" 75%|███████▌ | 96/128 [02:52<00:54, 1.70s/it]\n",
|
453 |
+
" 75%|███████▌ | 96/128 [02:52<00:54, 1.70s/it]"
|
454 |
+
]
|
455 |
+
},
|
456 |
+
{
|
457 |
+
"name": "stdout",
|
458 |
+
"output_type": "stream",
|
459 |
+
"text": [
|
460 |
+
"{'eval_loss': 0.03409378230571747, 'eval_runtime': 0.6622, 'eval_samples_per_second': 27.181, 'eval_steps_per_second': 3.02, 'epoch': 3.0}\n"
|
461 |
+
]
|
462 |
+
},
|
463 |
+
{
|
464 |
+
"name": "stderr",
|
465 |
+
"output_type": "stream",
|
466 |
+
"text": [
|
467 |
+
"100%|██████████| 128/128 [03:54<00:00, 1.87s/it]\n",
|
468 |
+
"100%|██████████| 128/128 [03:58<00:00, 1.86s/it]"
|
469 |
+
]
|
470 |
+
},
|
471 |
+
{
|
472 |
+
"name": "stdout",
|
473 |
+
"output_type": "stream",
|
474 |
+
"text": [
|
475 |
+
"{'eval_loss': 0.024491995573043823, 'eval_runtime': 0.543, 'eval_samples_per_second': 33.147, 'eval_steps_per_second': 3.683, 'epoch': 4.0}\n",
|
476 |
+
"{'train_runtime': 238.5125, 'train_samples_per_second': 2.683, 'train_steps_per_second': 0.537, 'train_loss': 0.09053848683834076, 'epoch': 4.0}\n"
|
477 |
+
]
|
478 |
+
},
|
479 |
+
{
|
480 |
+
"name": "stderr",
|
481 |
+
"output_type": "stream",
|
482 |
+
"text": [
|
483 |
+
"\n"
|
484 |
+
]
|
485 |
+
},
|
486 |
+
{
|
487 |
+
"data": {
|
488 |
+
"text/plain": [
|
489 |
+
"TrainOutput(global_step=128, training_loss=0.09053848683834076, metrics={'train_runtime': 238.5125, 'train_samples_per_second': 2.683, 'train_steps_per_second': 0.537, 'total_flos': 1818871829700.0, 'train_loss': 0.09053848683834076, 'epoch': 4.0})"
|
490 |
+
]
|
491 |
+
},
|
492 |
+
"execution_count": 74,
|
493 |
+
"metadata": {},
|
494 |
+
"output_type": "execute_result"
|
495 |
+
}
|
496 |
+
],
|
497 |
+
"source": [
|
498 |
+
"# Train (Here, fine tune) the model\n",
|
499 |
+
"trainer.train()"
|
500 |
+
]
|
501 |
+
},
|
502 |
+
{
|
503 |
+
"cell_type": "code",
|
504 |
+
"execution_count": 75,
|
505 |
+
"metadata": {},
|
506 |
+
"outputs": [
|
507 |
+
{
|
508 |
+
"name": "stderr",
|
509 |
+
"output_type": "stream",
|
510 |
+
"text": [
|
511 |
+
"Map: 100%|██████████| 20/20 [00:00<00:00, 50.43 examples/s]\n",
|
512 |
+
"100%|██████████| 2/2 [00:00<00:00, 13.74it/s]\n"
|
513 |
+
]
|
514 |
+
},
|
515 |
+
{
|
516 |
+
"data": {
|
517 |
+
"text/plain": [
|
518 |
+
"array([0.85534549, 0.31081381, 0.90419859, 0.87101161, 0.78344548,\n",
|
519 |
+
" 0.30044168, 0.93448901, 0.90961564, 0.58258021, 0.93629748,\n",
|
520 |
+
" 0.91476035, 0.34552005, 0.77351129, 0.48210973, 0.433981 ,\n",
|
521 |
+
" 0.27944249, 0.89211512, 0.2244986 , 0.25287008, 0.07797185])"
|
522 |
+
]
|
523 |
+
},
|
524 |
+
"execution_count": 75,
|
525 |
+
"metadata": {},
|
526 |
+
"output_type": "execute_result"
|
527 |
+
}
|
528 |
+
],
|
529 |
+
"source": [
|
530 |
+
"# Report loss for your model using the test set\n",
|
531 |
+
"tokenized_ds_test = tokenize_sentence_col(ds_test)\n",
|
532 |
+
"tokenized_ds_test = tokenized_ds_test.rename_columns({'Label' : 'labels'})\n",
|
533 |
+
"\n",
|
534 |
+
"preds = trainer.predict(tokenized_ds_test).predictions.astype(float)\n",
|
535 |
+
"preds"
|
536 |
+
]
|
537 |
+
},
|
538 |
+
{
|
539 |
+
"cell_type": "code",
|
540 |
+
"execution_count": 77,
|
541 |
+
"metadata": {},
|
542 |
+
"outputs": [
|
543 |
+
{
|
544 |
+
"name": "stdout",
|
545 |
+
"output_type": "stream",
|
546 |
+
"text": [
|
547 |
+
"MAE: 0.10661058641970159\n"
|
548 |
+
]
|
549 |
+
},
|
550 |
+
{
|
551 |
+
"data": {
|
552 |
+
"text/html": [
|
553 |
+
"<div>\n",
|
554 |
+
"<style scoped>\n",
|
555 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
556 |
+
" vertical-align: middle;\n",
|
557 |
+
" }\n",
|
558 |
+
"\n",
|
559 |
+
" .dataframe tbody tr th {\n",
|
560 |
+
" vertical-align: top;\n",
|
561 |
+
" }\n",
|
562 |
+
"\n",
|
563 |
+
" .dataframe thead th {\n",
|
564 |
+
" text-align: right;\n",
|
565 |
+
" }\n",
|
566 |
+
"</style>\n",
|
567 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
568 |
+
" <thead>\n",
|
569 |
+
" <tr style=\"text-align: right;\">\n",
|
570 |
+
" <th></th>\n",
|
571 |
+
" <th>a</th>\n",
|
572 |
+
" <th>b</th>\n",
|
573 |
+
" </tr>\n",
|
574 |
+
" </thead>\n",
|
575 |
+
" <tbody>\n",
|
576 |
+
" <tr>\n",
|
577 |
+
" <th>0</th>\n",
|
578 |
+
" <td>0.85</td>\n",
|
579 |
+
" <td>0.855345</td>\n",
|
580 |
+
" </tr>\n",
|
581 |
+
" <tr>\n",
|
582 |
+
" <th>1</th>\n",
|
583 |
+
" <td>0.40</td>\n",
|
584 |
+
" <td>0.310814</td>\n",
|
585 |
+
" </tr>\n",
|
586 |
+
" <tr>\n",
|
587 |
+
" <th>2</th>\n",
|
588 |
+
" <td>0.80</td>\n",
|
589 |
+
" <td>0.904199</td>\n",
|
590 |
+
" </tr>\n",
|
591 |
+
" <tr>\n",
|
592 |
+
" <th>3</th>\n",
|
593 |
+
" <td>0.85</td>\n",
|
594 |
+
" <td>0.871012</td>\n",
|
595 |
+
" </tr>\n",
|
596 |
+
" <tr>\n",
|
597 |
+
" <th>4</th>\n",
|
598 |
+
" <td>0.70</td>\n",
|
599 |
+
" <td>0.783445</td>\n",
|
600 |
+
" </tr>\n",
|
601 |
+
" <tr>\n",
|
602 |
+
" <th>5</th>\n",
|
603 |
+
" <td>0.30</td>\n",
|
604 |
+
" <td>0.300442</td>\n",
|
605 |
+
" </tr>\n",
|
606 |
+
" <tr>\n",
|
607 |
+
" <th>6</th>\n",
|
608 |
+
" <td>0.75</td>\n",
|
609 |
+
" <td>0.934489</td>\n",
|
610 |
+
" </tr>\n",
|
611 |
+
" <tr>\n",
|
612 |
+
" <th>7</th>\n",
|
613 |
+
" <td>0.85</td>\n",
|
614 |
+
" <td>0.909616</td>\n",
|
615 |
+
" </tr>\n",
|
616 |
+
" <tr>\n",
|
617 |
+
" <th>8</th>\n",
|
618 |
+
" <td>0.70</td>\n",
|
619 |
+
" <td>0.582580</td>\n",
|
620 |
+
" </tr>\n",
|
621 |
+
" <tr>\n",
|
622 |
+
" <th>9</th>\n",
|
623 |
+
" <td>0.90</td>\n",
|
624 |
+
" <td>0.936297</td>\n",
|
625 |
+
" </tr>\n",
|
626 |
+
" <tr>\n",
|
627 |
+
" <th>10</th>\n",
|
628 |
+
" <td>0.70</td>\n",
|
629 |
+
" <td>0.914760</td>\n",
|
630 |
+
" </tr>\n",
|
631 |
+
" <tr>\n",
|
632 |
+
" <th>11</th>\n",
|
633 |
+
" <td>0.20</td>\n",
|
634 |
+
" <td>0.345520</td>\n",
|
635 |
+
" </tr>\n",
|
636 |
+
" <tr>\n",
|
637 |
+
" <th>12</th>\n",
|
638 |
+
" <td>0.90</td>\n",
|
639 |
+
" <td>0.773511</td>\n",
|
640 |
+
" </tr>\n",
|
641 |
+
" <tr>\n",
|
642 |
+
" <th>13</th>\n",
|
643 |
+
" <td>0.20</td>\n",
|
644 |
+
" <td>0.482110</td>\n",
|
645 |
+
" </tr>\n",
|
646 |
+
" <tr>\n",
|
647 |
+
" <th>14</th>\n",
|
648 |
+
" <td>0.40</td>\n",
|
649 |
+
" <td>0.433981</td>\n",
|
650 |
+
" </tr>\n",
|
651 |
+
" <tr>\n",
|
652 |
+
" <th>15</th>\n",
|
653 |
+
" <td>0.20</td>\n",
|
654 |
+
" <td>0.279442</td>\n",
|
655 |
+
" </tr>\n",
|
656 |
+
" <tr>\n",
|
657 |
+
" <th>16</th>\n",
|
658 |
+
" <td>0.75</td>\n",
|
659 |
+
" <td>0.892115</td>\n",
|
660 |
+
" </tr>\n",
|
661 |
+
" <tr>\n",
|
662 |
+
" <th>17</th>\n",
|
663 |
+
" <td>0.30</td>\n",
|
664 |
+
" <td>0.224499</td>\n",
|
665 |
+
" </tr>\n",
|
666 |
+
" <tr>\n",
|
667 |
+
" <th>18</th>\n",
|
668 |
+
" <td>0.00</td>\n",
|
669 |
+
" <td>0.252870</td>\n",
|
670 |
+
" </tr>\n",
|
671 |
+
" <tr>\n",
|
672 |
+
" <th>19</th>\n",
|
673 |
+
" <td>0.00</td>\n",
|
674 |
+
" <td>0.077972</td>\n",
|
675 |
+
" </tr>\n",
|
676 |
+
" </tbody>\n",
|
677 |
+
"</table>\n",
|
678 |
+
"</div>"
|
679 |
+
],
|
680 |
+
"text/plain": [
|
681 |
+
" a b\n",
|
682 |
+
"0 0.85 0.855345\n",
|
683 |
+
"1 0.40 0.310814\n",
|
684 |
+
"2 0.80 0.904199\n",
|
685 |
+
"3 0.85 0.871012\n",
|
686 |
+
"4 0.70 0.783445\n",
|
687 |
+
"5 0.30 0.300442\n",
|
688 |
+
"6 0.75 0.934489\n",
|
689 |
+
"7 0.85 0.909616\n",
|
690 |
+
"8 0.70 0.582580\n",
|
691 |
+
"9 0.90 0.936297\n",
|
692 |
+
"10 0.70 0.914760\n",
|
693 |
+
"11 0.20 0.345520\n",
|
694 |
+
"12 0.90 0.773511\n",
|
695 |
+
"13 0.20 0.482110\n",
|
696 |
+
"14 0.40 0.433981\n",
|
697 |
+
"15 0.20 0.279442\n",
|
698 |
+
"16 0.75 0.892115\n",
|
699 |
+
"17 0.30 0.224499\n",
|
700 |
+
"18 0.00 0.252870\n",
|
701 |
+
"19 0.00 0.077972"
|
702 |
+
]
|
703 |
+
},
|
704 |
+
"execution_count": 77,
|
705 |
+
"metadata": {},
|
706 |
+
"output_type": "execute_result"
|
707 |
+
}
|
708 |
+
],
|
709 |
+
"source": [
|
710 |
+
"# Using MAE to calculate loss\n",
|
711 |
+
"def get_mae(preds, real):\n",
|
712 |
+
" '''\n",
|
713 |
+
" preds, real: array \n",
|
714 |
+
" '''\n",
|
715 |
+
"\n",
|
716 |
+
" mae = np.mean(np.abs(preds - real))\n",
|
717 |
+
" return mae\n",
|
718 |
+
"\n",
|
719 |
+
"real = np.array(tokenized_ds_test['labels'])\n",
|
720 |
+
"\n",
|
721 |
+
"print(f\"MAE: {get_mae(preds, real)}\")\n",
|
722 |
+
"\n",
|
723 |
+
"# Print predictions on test side-by-side\n",
|
724 |
+
"m = pd.DataFrame({'a':real.reshape(20,), 'b':preds.reshape(20)})\n",
|
725 |
+
"m"
|
726 |
+
]
|
727 |
+
},
|
728 |
+
{
|
729 |
+
"cell_type": "code",
|
730 |
+
"execution_count": null,
|
731 |
+
"metadata": {},
|
732 |
+
"outputs": [],
|
733 |
+
"source": [
|
734 |
+
"# MAE of my model: 0.1 (Based on test set)"
|
735 |
+
]
|
736 |
+
},
|
737 |
+
{
|
738 |
+
"cell_type": "markdown",
|
739 |
+
"metadata": {},
|
740 |
+
"source": [
|
741 |
+
"# Check if your GPU is available"
|
742 |
+
]
|
743 |
+
},
|
744 |
+
{
|
745 |
+
"cell_type": "code",
|
746 |
+
"execution_count": 79,
|
747 |
+
"metadata": {},
|
748 |
+
"outputs": [
|
749 |
+
{
|
750 |
+
"data": {
|
751 |
+
"text/plain": [
|
752 |
+
"False"
|
753 |
+
]
|
754 |
+
},
|
755 |
+
"execution_count": 79,
|
756 |
+
"metadata": {},
|
757 |
+
"output_type": "execute_result"
|
758 |
+
}
|
759 |
+
],
|
760 |
+
"source": [
|
761 |
+
"import torch\n",
|
762 |
+
"torch.cuda.is_available()"
|
763 |
+
]
|
764 |
+
}
|
765 |
+
],
|
766 |
+
"metadata": {
|
767 |
+
"kernelspec": {
|
768 |
+
"display_name": "venv",
|
769 |
+
"language": "python",
|
770 |
+
"name": "python3"
|
771 |
+
},
|
772 |
+
"language_info": {
|
773 |
+
"codemirror_mode": {
|
774 |
+
"name": "ipython",
|
775 |
+
"version": 3
|
776 |
+
},
|
777 |
+
"file_extension": ".py",
|
778 |
+
"mimetype": "text/x-python",
|
779 |
+
"name": "python",
|
780 |
+
"nbconvert_exporter": "python",
|
781 |
+
"pygments_lexer": "ipython3",
|
782 |
+
"version": "3.12.6"
|
783 |
+
}
|
784 |
+
},
|
785 |
+
"nbformat": 4,
|
786 |
+
"nbformat_minor": 2
|
787 |
+
}
|
ml_engine/ml-model-development-for-title-detection/try1_29Oct_ExportingModel.ipynb
ADDED
@@ -0,0 +1,949 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"metadata": {},
|
6 |
+
"source": [
|
7 |
+
"# Imports"
|
8 |
+
]
|
9 |
+
},
|
10 |
+
{
|
11 |
+
"cell_type": "code",
|
12 |
+
"execution_count": 1,
|
13 |
+
"metadata": {},
|
14 |
+
"outputs": [],
|
15 |
+
"source": [
|
16 |
+
"# imports\n",
|
17 |
+
"import pandas as pd\n",
|
18 |
+
"import numpy as np\n",
|
19 |
+
"# import matplotlib as plt\n",
|
20 |
+
"import random as rn\n",
|
21 |
+
"import os\n",
|
22 |
+
"os.environ['PYTHONHASHSEED'] = '0'\n",
|
23 |
+
"os.environ['CUDA_VISIBLE_DEVICES'] = ''\n",
|
24 |
+
"np.random.seed(37)\n",
|
25 |
+
"rn.seed(1254)"
|
26 |
+
]
|
27 |
+
},
|
28 |
+
{
|
29 |
+
"cell_type": "markdown",
|
30 |
+
"metadata": {},
|
31 |
+
"source": [
|
32 |
+
"# Load data, train, test, validation splits"
|
33 |
+
]
|
34 |
+
},
|
35 |
+
{
|
36 |
+
"cell_type": "code",
|
37 |
+
"execution_count": 2,
|
38 |
+
"metadata": {},
|
39 |
+
"outputs": [
|
40 |
+
{
|
41 |
+
"name": "stdout",
|
42 |
+
"output_type": "stream",
|
43 |
+
"text": [
|
44 |
+
"<class 'pandas.core.frame.DataFrame'>\n"
|
45 |
+
]
|
46 |
+
},
|
47 |
+
{
|
48 |
+
"data": {
|
49 |
+
"text/html": [
|
50 |
+
"<div>\n",
|
51 |
+
"<style scoped>\n",
|
52 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
53 |
+
" vertical-align: middle;\n",
|
54 |
+
" }\n",
|
55 |
+
"\n",
|
56 |
+
" .dataframe tbody tr th {\n",
|
57 |
+
" vertical-align: top;\n",
|
58 |
+
" }\n",
|
59 |
+
"\n",
|
60 |
+
" .dataframe thead th {\n",
|
61 |
+
" text-align: right;\n",
|
62 |
+
" }\n",
|
63 |
+
"</style>\n",
|
64 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
65 |
+
" <thead>\n",
|
66 |
+
" <tr style=\"text-align: right;\">\n",
|
67 |
+
" <th></th>\n",
|
68 |
+
" <th>Sentence</th>\n",
|
69 |
+
" <th>Label</th>\n",
|
70 |
+
" </tr>\n",
|
71 |
+
" <tr>\n",
|
72 |
+
" <th>S.No.</th>\n",
|
73 |
+
" <th></th>\n",
|
74 |
+
" <th></th>\n",
|
75 |
+
" </tr>\n",
|
76 |
+
" </thead>\n",
|
77 |
+
" <tbody>\n",
|
78 |
+
" <tr>\n",
|
79 |
+
" <th>1</th>\n",
|
80 |
+
" <td>Introduction to Quantum Mechanics</td>\n",
|
81 |
+
" <td>1.0</td>\n",
|
82 |
+
" </tr>\n",
|
83 |
+
" <tr>\n",
|
84 |
+
" <th>2</th>\n",
|
85 |
+
" <td>In this chapter, we explore the foundational p...</td>\n",
|
86 |
+
" <td>0.0</td>\n",
|
87 |
+
" </tr>\n",
|
88 |
+
" <tr>\n",
|
89 |
+
" <th>3</th>\n",
|
90 |
+
" <td>The Rise and Fall of Civilizations</td>\n",
|
91 |
+
" <td>1.0</td>\n",
|
92 |
+
" </tr>\n",
|
93 |
+
" <tr>\n",
|
94 |
+
" <th>4</th>\n",
|
95 |
+
" <td>Historical records reveal the complex trajecto...</td>\n",
|
96 |
+
" <td>0.0</td>\n",
|
97 |
+
" </tr>\n",
|
98 |
+
" <tr>\n",
|
99 |
+
" <th>5</th>\n",
|
100 |
+
" <td>Part III: Advanced Mathematical Concepts</td>\n",
|
101 |
+
" <td>1.0</td>\n",
|
102 |
+
" </tr>\n",
|
103 |
+
" </tbody>\n",
|
104 |
+
"</table>\n",
|
105 |
+
"</div>"
|
106 |
+
],
|
107 |
+
"text/plain": [
|
108 |
+
" Sentence Label\n",
|
109 |
+
"S.No. \n",
|
110 |
+
"1 Introduction to Quantum Mechanics 1.0\n",
|
111 |
+
"2 In this chapter, we explore the foundational p... 0.0\n",
|
112 |
+
"3 The Rise and Fall of Civilizations 1.0\n",
|
113 |
+
"4 Historical records reveal the complex trajecto... 0.0\n",
|
114 |
+
"5 Part III: Advanced Mathematical Concepts 1.0"
|
115 |
+
]
|
116 |
+
},
|
117 |
+
"metadata": {},
|
118 |
+
"output_type": "display_data"
|
119 |
+
},
|
120 |
+
{
|
121 |
+
"data": {
|
122 |
+
"text/html": [
|
123 |
+
"<div>\n",
|
124 |
+
"<style scoped>\n",
|
125 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
126 |
+
" vertical-align: middle;\n",
|
127 |
+
" }\n",
|
128 |
+
"\n",
|
129 |
+
" .dataframe tbody tr th {\n",
|
130 |
+
" vertical-align: top;\n",
|
131 |
+
" }\n",
|
132 |
+
"\n",
|
133 |
+
" .dataframe thead th {\n",
|
134 |
+
" text-align: right;\n",
|
135 |
+
" }\n",
|
136 |
+
"</style>\n",
|
137 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
138 |
+
" <thead>\n",
|
139 |
+
" <tr style=\"text-align: right;\">\n",
|
140 |
+
" <th></th>\n",
|
141 |
+
" <th>Label</th>\n",
|
142 |
+
" </tr>\n",
|
143 |
+
" </thead>\n",
|
144 |
+
" <tbody>\n",
|
145 |
+
" <tr>\n",
|
146 |
+
" <th>count</th>\n",
|
147 |
+
" <td>198.000000</td>\n",
|
148 |
+
" </tr>\n",
|
149 |
+
" <tr>\n",
|
150 |
+
" <th>mean</th>\n",
|
151 |
+
" <td>0.555051</td>\n",
|
152 |
+
" </tr>\n",
|
153 |
+
" <tr>\n",
|
154 |
+
" <th>std</th>\n",
|
155 |
+
" <td>0.313770</td>\n",
|
156 |
+
" </tr>\n",
|
157 |
+
" <tr>\n",
|
158 |
+
" <th>min</th>\n",
|
159 |
+
" <td>0.000000</td>\n",
|
160 |
+
" </tr>\n",
|
161 |
+
" <tr>\n",
|
162 |
+
" <th>25%</th>\n",
|
163 |
+
" <td>0.300000</td>\n",
|
164 |
+
" </tr>\n",
|
165 |
+
" <tr>\n",
|
166 |
+
" <th>50%</th>\n",
|
167 |
+
" <td>0.650000</td>\n",
|
168 |
+
" </tr>\n",
|
169 |
+
" <tr>\n",
|
170 |
+
" <th>75%</th>\n",
|
171 |
+
" <td>0.800000</td>\n",
|
172 |
+
" </tr>\n",
|
173 |
+
" <tr>\n",
|
174 |
+
" <th>max</th>\n",
|
175 |
+
" <td>1.000000</td>\n",
|
176 |
+
" </tr>\n",
|
177 |
+
" </tbody>\n",
|
178 |
+
"</table>\n",
|
179 |
+
"</div>"
|
180 |
+
],
|
181 |
+
"text/plain": [
|
182 |
+
" Label\n",
|
183 |
+
"count 198.000000\n",
|
184 |
+
"mean 0.555051\n",
|
185 |
+
"std 0.313770\n",
|
186 |
+
"min 0.000000\n",
|
187 |
+
"25% 0.300000\n",
|
188 |
+
"50% 0.650000\n",
|
189 |
+
"75% 0.800000\n",
|
190 |
+
"max 1.000000"
|
191 |
+
]
|
192 |
+
},
|
193 |
+
"metadata": {},
|
194 |
+
"output_type": "display_data"
|
195 |
+
},
|
196 |
+
{
|
197 |
+
"data": {
|
198 |
+
"text/plain": [
|
199 |
+
"(198, 2)"
|
200 |
+
]
|
201 |
+
},
|
202 |
+
"metadata": {},
|
203 |
+
"output_type": "display_data"
|
204 |
+
}
|
205 |
+
],
|
206 |
+
"source": [
|
207 |
+
"# EDA\n",
|
208 |
+
"path_to_data = \"./data/Sentences_200.csv\"\n",
|
209 |
+
"new_data_5_cat = pd.read_csv(path_to_data, index_col='S.No.')\n",
|
210 |
+
"print(type(new_data_5_cat))\n",
|
211 |
+
"display(new_data_5_cat.head())\n",
|
212 |
+
"display(new_data_5_cat.describe())\n",
|
213 |
+
"display(new_data_5_cat.shape)"
|
214 |
+
]
|
215 |
+
},
|
216 |
+
{
|
217 |
+
"cell_type": "code",
|
218 |
+
"execution_count": 3,
|
219 |
+
"metadata": {},
|
220 |
+
"outputs": [
|
221 |
+
{
|
222 |
+
"name": "stderr",
|
223 |
+
"output_type": "stream",
|
224 |
+
"text": [
|
225 |
+
"c:\\Users\\Geetansh\\Desktop\\New_folder\\venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
226 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
227 |
+
]
|
228 |
+
},
|
229 |
+
{
|
230 |
+
"data": {
|
231 |
+
"text/plain": [
|
232 |
+
"Dataset({\n",
|
233 |
+
" features: ['Sentence', 'Label', 'S.No.'],\n",
|
234 |
+
" num_rows: 160\n",
|
235 |
+
"})"
|
236 |
+
]
|
237 |
+
},
|
238 |
+
"metadata": {},
|
239 |
+
"output_type": "display_data"
|
240 |
+
},
|
241 |
+
{
|
242 |
+
"data": {
|
243 |
+
"text/plain": [
|
244 |
+
"Dataset({\n",
|
245 |
+
" features: ['Sentence', 'Label', 'S.No.'],\n",
|
246 |
+
" num_rows: 20\n",
|
247 |
+
"})"
|
248 |
+
]
|
249 |
+
},
|
250 |
+
"metadata": {},
|
251 |
+
"output_type": "display_data"
|
252 |
+
},
|
253 |
+
{
|
254 |
+
"data": {
|
255 |
+
"text/plain": [
|
256 |
+
"Dataset({\n",
|
257 |
+
" features: ['Sentence', 'Label', 'S.No.'],\n",
|
258 |
+
" num_rows: 18\n",
|
259 |
+
"})"
|
260 |
+
]
|
261 |
+
},
|
262 |
+
"metadata": {},
|
263 |
+
"output_type": "display_data"
|
264 |
+
}
|
265 |
+
],
|
266 |
+
"source": [
|
267 |
+
"# Make test, train, cv splits\n",
|
268 |
+
"from datasets import Dataset\n",
|
269 |
+
"ds = Dataset.from_pandas(new_data_5_cat)\n",
|
270 |
+
"\n",
|
271 |
+
"ds_train_temp_dict = ds.train_test_split(train_size=160)\n",
|
272 |
+
"ds_train = ds_train_temp_dict['train']\n",
|
273 |
+
"ds_test_cv_dict = ds_train_temp_dict['test'].train_test_split(test_size=20)\n",
|
274 |
+
"ds_cv = ds_test_cv_dict['train']\n",
|
275 |
+
"ds_test = ds_test_cv_dict['test']\n",
|
276 |
+
"display(ds_train)\n",
|
277 |
+
"display(ds_test)\n",
|
278 |
+
"display(ds_cv)"
|
279 |
+
]
|
280 |
+
},
|
281 |
+
{
|
282 |
+
"cell_type": "markdown",
|
283 |
+
"metadata": {},
|
284 |
+
"source": [
|
285 |
+
"# Fine tune LLM"
|
286 |
+
]
|
287 |
+
},
|
288 |
+
{
|
289 |
+
"cell_type": "code",
|
290 |
+
"execution_count": null,
|
291 |
+
"metadata": {},
|
292 |
+
"outputs": [
|
293 |
+
{
|
294 |
+
"name": "stderr",
|
295 |
+
"output_type": "stream",
|
296 |
+
"text": [
|
297 |
+
"c:\\Users\\Geetansh\\Desktop\\New_folder\\venv\\Lib\\site-packages\\transformers\\convert_slow_tokenizer.py:561: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.\n",
|
298 |
+
" warnings.warn(\n"
|
299 |
+
]
|
300 |
+
},
|
301 |
+
{
|
302 |
+
"data": {
|
303 |
+
"text/plain": [
|
304 |
+
"['▁My', '▁name', '▁is', '▁Geeta', 'n', 'sh', '▁Bhardwaj', '.']"
|
305 |
+
]
|
306 |
+
},
|
307 |
+
"execution_count": 4,
|
308 |
+
"metadata": {},
|
309 |
+
"output_type": "execute_result"
|
310 |
+
}
|
311 |
+
],
|
312 |
+
"source": [
|
313 |
+
"# Get Tokenizer\n",
|
314 |
+
"from transformers import AutoTokenizer\n",
|
315 |
+
"model_nm = 'microsoft/deberta-v3-small'\n",
|
316 |
+
"tokz = AutoTokenizer.from_pretrained(model_nm)\n",
|
317 |
+
"tokz.tokenize('My name is Geetansh Bhardwaj.')"
|
318 |
+
]
|
319 |
+
},
|
320 |
+
{
|
321 |
+
"cell_type": "code",
|
322 |
+
"execution_count": 5,
|
323 |
+
"metadata": {},
|
324 |
+
"outputs": [
|
325 |
+
{
|
326 |
+
"name": "stderr",
|
327 |
+
"output_type": "stream",
|
328 |
+
"text": [
|
329 |
+
"Map: 100%|██████████| 160/160 [00:00<00:00, 4079.69 examples/s]\n"
|
330 |
+
]
|
331 |
+
}
|
332 |
+
],
|
333 |
+
"source": [
|
334 |
+
"# Tokenize the 'Sentence' column\n",
|
335 |
+
"def tokenize_string(row):\n",
|
336 |
+
" return tokz(row['Sentence'])\n",
|
337 |
+
"\n",
|
338 |
+
"def tokenize_sentence_col(ds):\n",
|
339 |
+
" '''\n",
|
340 |
+
" We will tokenize the 'Sentence' column and add another column 'Sentence_id'. It will be used for fine-tuning\n",
|
341 |
+
" ds: a dataset with 'Sentence' column\n",
|
342 |
+
" '''\n",
|
343 |
+
"\n",
|
344 |
+
" tokenized_ds = ds.map(tokenize_string, batch_size=5)\n",
|
345 |
+
" return tokenized_ds\n",
|
346 |
+
"\n",
|
347 |
+
"tokenized_ds_train = tokenize_sentence_col(ds_train)"
|
348 |
+
]
|
349 |
+
},
|
350 |
+
{
|
351 |
+
"cell_type": "code",
|
352 |
+
"execution_count": 6,
|
353 |
+
"metadata": {},
|
354 |
+
"outputs": [
|
355 |
+
{
|
356 |
+
"name": "stderr",
|
357 |
+
"output_type": "stream",
|
358 |
+
"text": [
|
359 |
+
"Map: 100%|██████████| 18/18 [00:00<00:00, 2243.01 examples/s]\n"
|
360 |
+
]
|
361 |
+
}
|
362 |
+
],
|
363 |
+
"source": [
|
364 |
+
"# An undocumented fact: Transformers assume that your label column is named \"labels\". Ours is named \"Label\", so we will change that\n",
|
365 |
+
"tokenized_ds_train = tokenized_ds_train.rename_columns({'Label' : 'labels'})\n",
|
366 |
+
"tokenized_ds_train\n",
|
367 |
+
"\n",
|
368 |
+
"tokenized_ds_cv = tokenize_sentence_col(ds_cv)\n",
|
369 |
+
"tokenized_ds_cv = tokenized_ds_cv.rename_columns({'Label' : 'labels'})"
|
370 |
+
]
|
371 |
+
},
|
372 |
+
{
|
373 |
+
"cell_type": "code",
|
374 |
+
"execution_count": 7,
|
375 |
+
"metadata": {},
|
376 |
+
"outputs": [
|
377 |
+
{
|
378 |
+
"name": "stderr",
|
379 |
+
"output_type": "stream",
|
380 |
+
"text": [
|
381 |
+
"Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']\n",
|
382 |
+
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
|
383 |
+
]
|
384 |
+
}
|
385 |
+
],
|
386 |
+
"source": [
|
387 |
+
"# Get the model (We are actually using a pre-trained one)\n",
|
388 |
+
"from transformers import AutoModelForSequenceClassification\n",
|
389 |
+
"my_model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)"
|
390 |
+
]
|
391 |
+
},
|
392 |
+
{
|
393 |
+
"cell_type": "code",
|
394 |
+
"execution_count": 8,
|
395 |
+
"metadata": {},
|
396 |
+
"outputs": [
|
397 |
+
{
|
398 |
+
"name": "stdout",
|
399 |
+
"output_type": "stream",
|
400 |
+
"text": [
|
401 |
+
"WARNING:tensorflow:From c:\\Users\\Geetansh\\Desktop\\New_folder\\venv\\Lib\\site-packages\\tf_keras\\src\\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.\n",
|
402 |
+
"\n"
|
403 |
+
]
|
404 |
+
},
|
405 |
+
{
|
406 |
+
"name": "stderr",
|
407 |
+
"output_type": "stream",
|
408 |
+
"text": [
|
409 |
+
"c:\\Users\\Geetansh\\Desktop\\New_folder\\venv\\Lib\\site-packages\\transformers\\training_args.py:1559: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead\n",
|
410 |
+
" warnings.warn(\n",
|
411 |
+
"C:\\Users\\Geetansh\\AppData\\Local\\Temp\\ipykernel_4252\\1403743469.py:8: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.\n",
|
412 |
+
" trainer = Trainer(my_model, args, train_dataset=tokenized_ds_train, eval_dataset=tokenized_ds_cv,\n"
|
413 |
+
]
|
414 |
+
}
|
415 |
+
],
|
416 |
+
"source": [
|
417 |
+
"from transformers import TrainingArguments, Trainer\n",
|
418 |
+
"bs = 5\n",
|
419 |
+
"epochs = 4\n",
|
420 |
+
"lr = 8e-5\n",
|
421 |
+
"args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,\n",
|
422 |
+
" evaluation_strategy=\"epoch\", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,\n",
|
423 |
+
" num_train_epochs=epochs, weight_decay=0.01, report_to='none')\n",
|
424 |
+
"trainer = Trainer(my_model, args, train_dataset=tokenized_ds_train, eval_dataset=tokenized_ds_cv,\n",
|
425 |
+
" tokenizer=tokz)"
|
426 |
+
]
|
427 |
+
},
|
428 |
+
{
|
429 |
+
"cell_type": "code",
|
430 |
+
"execution_count": 9,
|
431 |
+
"metadata": {},
|
432 |
+
"outputs": [
|
433 |
+
{
|
434 |
+
"name": "stderr",
|
435 |
+
"output_type": "stream",
|
436 |
+
"text": [
|
437 |
+
" \n",
|
438 |
+
" 25%|██▌ | 32/128 [00:10<00:26, 3.56it/s]"
|
439 |
+
]
|
440 |
+
},
|
441 |
+
{
|
442 |
+
"name": "stdout",
|
443 |
+
"output_type": "stream",
|
444 |
+
"text": [
|
445 |
+
"{'eval_loss': 0.09050914645195007, 'eval_runtime': 0.3554, 'eval_samples_per_second': 50.653, 'eval_steps_per_second': 5.628, 'epoch': 1.0}\n"
|
446 |
+
]
|
447 |
+
},
|
448 |
+
{
|
449 |
+
"name": "stderr",
|
450 |
+
"output_type": "stream",
|
451 |
+
"text": [
|
452 |
+
" \n",
|
453 |
+
" 50%|█████ | 64/128 [00:19<00:17, 3.68it/s]"
|
454 |
+
]
|
455 |
+
},
|
456 |
+
{
|
457 |
+
"name": "stdout",
|
458 |
+
"output_type": "stream",
|
459 |
+
"text": [
|
460 |
+
"{'eval_loss': 0.04030601680278778, 'eval_runtime': 0.3239, 'eval_samples_per_second': 55.567, 'eval_steps_per_second': 6.174, 'epoch': 2.0}\n"
|
461 |
+
]
|
462 |
+
},
|
463 |
+
{
|
464 |
+
"name": "stderr",
|
465 |
+
"output_type": "stream",
|
466 |
+
"text": [
|
467 |
+
" \n",
|
468 |
+
" 76%|███████▌ | 97/128 [00:28<00:10, 2.98it/s]"
|
469 |
+
]
|
470 |
+
},
|
471 |
+
{
|
472 |
+
"name": "stdout",
|
473 |
+
"output_type": "stream",
|
474 |
+
"text": [
|
475 |
+
"{'eval_loss': 0.022483834996819496, 'eval_runtime': 0.3246, 'eval_samples_per_second': 55.448, 'eval_steps_per_second': 6.161, 'epoch': 3.0}\n"
|
476 |
+
]
|
477 |
+
},
|
478 |
+
{
|
479 |
+
"name": "stderr",
|
480 |
+
"output_type": "stream",
|
481 |
+
"text": [
|
482 |
+
" \n",
|
483 |
+
"100%|██████████| 128/128 [00:41<00:00, 3.07it/s]"
|
484 |
+
]
|
485 |
+
},
|
486 |
+
{
|
487 |
+
"name": "stdout",
|
488 |
+
"output_type": "stream",
|
489 |
+
"text": [
|
490 |
+
"{'eval_loss': 0.0200485959649086, 'eval_runtime': 0.3606, 'eval_samples_per_second': 49.921, 'eval_steps_per_second': 5.547, 'epoch': 4.0}\n",
|
491 |
+
"{'train_runtime': 41.7528, 'train_samples_per_second': 15.328, 'train_steps_per_second': 3.066, 'train_loss': 0.11997667700052261, 'epoch': 4.0}\n"
|
492 |
+
]
|
493 |
+
},
|
494 |
+
{
|
495 |
+
"name": "stderr",
|
496 |
+
"output_type": "stream",
|
497 |
+
"text": [
|
498 |
+
"\n"
|
499 |
+
]
|
500 |
+
},
|
501 |
+
{
|
502 |
+
"data": {
|
503 |
+
"text/plain": [
|
504 |
+
"TrainOutput(global_step=128, training_loss=0.11997667700052261, metrics={'train_runtime': 41.7528, 'train_samples_per_second': 15.328, 'train_steps_per_second': 3.066, 'total_flos': 1818871829700.0, 'train_loss': 0.11997667700052261, 'epoch': 4.0})"
|
505 |
+
]
|
506 |
+
},
|
507 |
+
"execution_count": 9,
|
508 |
+
"metadata": {},
|
509 |
+
"output_type": "execute_result"
|
510 |
+
}
|
511 |
+
],
|
512 |
+
"source": [
|
513 |
+
"# Train (Here, fine tune) the model\n",
|
514 |
+
"trainer.train()"
|
515 |
+
]
|
516 |
+
},
|
517 |
+
{
|
518 |
+
"cell_type": "code",
|
519 |
+
"execution_count": 10,
|
520 |
+
"metadata": {},
|
521 |
+
"outputs": [
|
522 |
+
{
|
523 |
+
"name": "stderr",
|
524 |
+
"output_type": "stream",
|
525 |
+
"text": [
|
526 |
+
"Map: 100%|██████████| 20/20 [00:00<00:00, 162.84 examples/s]\n",
|
527 |
+
"100%|██████████| 2/2 [00:00<00:00, 13.26it/s]\n"
|
528 |
+
]
|
529 |
+
},
|
530 |
+
{
|
531 |
+
"data": {
|
532 |
+
"text/plain": [
|
533 |
+
"array([0.86230469, 0.28979492, 0.91162109, 0.86816406, 0.87988281,\n",
|
534 |
+
" 0.21826172, 0.91064453, 0.89013672, 0.41748047, 0.8984375 ,\n",
|
535 |
+
" 0.89355469, 0.14257812, 0.89160156, 0.35131836, 0.34375 ,\n",
|
536 |
+
" 0.23815918, 0.87841797, 0.20471191, 0.10784912, 0.02485657])"
|
537 |
+
]
|
538 |
+
},
|
539 |
+
"execution_count": 10,
|
540 |
+
"metadata": {},
|
541 |
+
"output_type": "execute_result"
|
542 |
+
}
|
543 |
+
],
|
544 |
+
"source": [
|
545 |
+
"# Report loss for your model using the test set\n",
|
546 |
+
"tokenized_ds_test = tokenize_sentence_col(ds_test)\n",
|
547 |
+
"tokenized_ds_test = tokenized_ds_test.rename_columns({'Label' : 'labels'})\n",
|
548 |
+
"\n",
|
549 |
+
"preds = trainer.predict(tokenized_ds_test).predictions.astype(float)\n",
|
550 |
+
"preds"
|
551 |
+
]
|
552 |
+
},
|
553 |
+
{
|
554 |
+
"cell_type": "code",
|
555 |
+
"execution_count": 11,
|
556 |
+
"metadata": {},
|
557 |
+
"outputs": [
|
558 |
+
{
|
559 |
+
"name": "stdout",
|
560 |
+
"output_type": "stream",
|
561 |
+
"text": [
|
562 |
+
"MAE: 0.09301467895507813\n"
|
563 |
+
]
|
564 |
+
},
|
565 |
+
{
|
566 |
+
"data": {
|
567 |
+
"text/html": [
|
568 |
+
"<div>\n",
|
569 |
+
"<style scoped>\n",
|
570 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
571 |
+
" vertical-align: middle;\n",
|
572 |
+
" }\n",
|
573 |
+
"\n",
|
574 |
+
" .dataframe tbody tr th {\n",
|
575 |
+
" vertical-align: top;\n",
|
576 |
+
" }\n",
|
577 |
+
"\n",
|
578 |
+
" .dataframe thead th {\n",
|
579 |
+
" text-align: right;\n",
|
580 |
+
" }\n",
|
581 |
+
"</style>\n",
|
582 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
583 |
+
" <thead>\n",
|
584 |
+
" <tr style=\"text-align: right;\">\n",
|
585 |
+
" <th></th>\n",
|
586 |
+
" <th>a</th>\n",
|
587 |
+
" <th>b</th>\n",
|
588 |
+
" </tr>\n",
|
589 |
+
" </thead>\n",
|
590 |
+
" <tbody>\n",
|
591 |
+
" <tr>\n",
|
592 |
+
" <th>0</th>\n",
|
593 |
+
" <td>0.85</td>\n",
|
594 |
+
" <td>0.862305</td>\n",
|
595 |
+
" </tr>\n",
|
596 |
+
" <tr>\n",
|
597 |
+
" <th>1</th>\n",
|
598 |
+
" <td>0.40</td>\n",
|
599 |
+
" <td>0.289795</td>\n",
|
600 |
+
" </tr>\n",
|
601 |
+
" <tr>\n",
|
602 |
+
" <th>2</th>\n",
|
603 |
+
" <td>0.80</td>\n",
|
604 |
+
" <td>0.911621</td>\n",
|
605 |
+
" </tr>\n",
|
606 |
+
" <tr>\n",
|
607 |
+
" <th>3</th>\n",
|
608 |
+
" <td>0.85</td>\n",
|
609 |
+
" <td>0.868164</td>\n",
|
610 |
+
" </tr>\n",
|
611 |
+
" <tr>\n",
|
612 |
+
" <th>4</th>\n",
|
613 |
+
" <td>0.70</td>\n",
|
614 |
+
" <td>0.879883</td>\n",
|
615 |
+
" </tr>\n",
|
616 |
+
" <tr>\n",
|
617 |
+
" <th>5</th>\n",
|
618 |
+
" <td>0.30</td>\n",
|
619 |
+
" <td>0.218262</td>\n",
|
620 |
+
" </tr>\n",
|
621 |
+
" <tr>\n",
|
622 |
+
" <th>6</th>\n",
|
623 |
+
" <td>0.75</td>\n",
|
624 |
+
" <td>0.910645</td>\n",
|
625 |
+
" </tr>\n",
|
626 |
+
" <tr>\n",
|
627 |
+
" <th>7</th>\n",
|
628 |
+
" <td>0.85</td>\n",
|
629 |
+
" <td>0.890137</td>\n",
|
630 |
+
" </tr>\n",
|
631 |
+
" <tr>\n",
|
632 |
+
" <th>8</th>\n",
|
633 |
+
" <td>0.70</td>\n",
|
634 |
+
" <td>0.417480</td>\n",
|
635 |
+
" </tr>\n",
|
636 |
+
" <tr>\n",
|
637 |
+
" <th>9</th>\n",
|
638 |
+
" <td>0.90</td>\n",
|
639 |
+
" <td>0.898438</td>\n",
|
640 |
+
" </tr>\n",
|
641 |
+
" <tr>\n",
|
642 |
+
" <th>10</th>\n",
|
643 |
+
" <td>0.70</td>\n",
|
644 |
+
" <td>0.893555</td>\n",
|
645 |
+
" </tr>\n",
|
646 |
+
" <tr>\n",
|
647 |
+
" <th>11</th>\n",
|
648 |
+
" <td>0.20</td>\n",
|
649 |
+
" <td>0.142578</td>\n",
|
650 |
+
" </tr>\n",
|
651 |
+
" <tr>\n",
|
652 |
+
" <th>12</th>\n",
|
653 |
+
" <td>0.90</td>\n",
|
654 |
+
" <td>0.891602</td>\n",
|
655 |
+
" </tr>\n",
|
656 |
+
" <tr>\n",
|
657 |
+
" <th>13</th>\n",
|
658 |
+
" <td>0.20</td>\n",
|
659 |
+
" <td>0.351318</td>\n",
|
660 |
+
" </tr>\n",
|
661 |
+
" <tr>\n",
|
662 |
+
" <th>14</th>\n",
|
663 |
+
" <td>0.40</td>\n",
|
664 |
+
" <td>0.343750</td>\n",
|
665 |
+
" </tr>\n",
|
666 |
+
" <tr>\n",
|
667 |
+
" <th>15</th>\n",
|
668 |
+
" <td>0.20</td>\n",
|
669 |
+
" <td>0.238159</td>\n",
|
670 |
+
" </tr>\n",
|
671 |
+
" <tr>\n",
|
672 |
+
" <th>16</th>\n",
|
673 |
+
" <td>0.75</td>\n",
|
674 |
+
" <td>0.878418</td>\n",
|
675 |
+
" </tr>\n",
|
676 |
+
" <tr>\n",
|
677 |
+
" <th>17</th>\n",
|
678 |
+
" <td>0.30</td>\n",
|
679 |
+
" <td>0.204712</td>\n",
|
680 |
+
" </tr>\n",
|
681 |
+
" <tr>\n",
|
682 |
+
" <th>18</th>\n",
|
683 |
+
" <td>0.00</td>\n",
|
684 |
+
" <td>0.107849</td>\n",
|
685 |
+
" </tr>\n",
|
686 |
+
" <tr>\n",
|
687 |
+
" <th>19</th>\n",
|
688 |
+
" <td>0.00</td>\n",
|
689 |
+
" <td>0.024857</td>\n",
|
690 |
+
" </tr>\n",
|
691 |
+
" </tbody>\n",
|
692 |
+
"</table>\n",
|
693 |
+
"</div>"
|
694 |
+
],
|
695 |
+
"text/plain": [
|
696 |
+
" a b\n",
|
697 |
+
"0 0.85 0.862305\n",
|
698 |
+
"1 0.40 0.289795\n",
|
699 |
+
"2 0.80 0.911621\n",
|
700 |
+
"3 0.85 0.868164\n",
|
701 |
+
"4 0.70 0.879883\n",
|
702 |
+
"5 0.30 0.218262\n",
|
703 |
+
"6 0.75 0.910645\n",
|
704 |
+
"7 0.85 0.890137\n",
|
705 |
+
"8 0.70 0.417480\n",
|
706 |
+
"9 0.90 0.898438\n",
|
707 |
+
"10 0.70 0.893555\n",
|
708 |
+
"11 0.20 0.142578\n",
|
709 |
+
"12 0.90 0.891602\n",
|
710 |
+
"13 0.20 0.351318\n",
|
711 |
+
"14 0.40 0.343750\n",
|
712 |
+
"15 0.20 0.238159\n",
|
713 |
+
"16 0.75 0.878418\n",
|
714 |
+
"17 0.30 0.204712\n",
|
715 |
+
"18 0.00 0.107849\n",
|
716 |
+
"19 0.00 0.024857"
|
717 |
+
]
|
718 |
+
},
|
719 |
+
"execution_count": 11,
|
720 |
+
"metadata": {},
|
721 |
+
"output_type": "execute_result"
|
722 |
+
}
|
723 |
+
],
|
724 |
+
"source": [
|
725 |
+
"# Using MAE to calculate loss\n",
|
726 |
+
"def get_mae(preds, real):\n",
|
727 |
+
" '''\n",
|
728 |
+
" preds, real: array \n",
|
729 |
+
" '''\n",
|
730 |
+
"\n",
|
731 |
+
" mae = np.mean(np.abs(preds - real))\n",
|
732 |
+
" return mae\n",
|
733 |
+
"\n",
|
734 |
+
"real = np.array(tokenized_ds_test['labels'])\n",
|
735 |
+
"\n",
|
736 |
+
"print(f\"MAE: {get_mae(preds, real)}\")\n",
|
737 |
+
"\n",
|
738 |
+
"# Print predictions on test side-by-side\n",
|
739 |
+
"m = pd.DataFrame({'a':real.reshape(20,), 'b':preds.reshape(20)})\n",
|
740 |
+
"m"
|
741 |
+
]
|
742 |
+
},
|
743 |
+
{
|
744 |
+
"cell_type": "code",
|
745 |
+
"execution_count": 12,
|
746 |
+
"metadata": {},
|
747 |
+
"outputs": [],
|
748 |
+
"source": [
|
749 |
+
"# MAE of my model: 0.1 (Based on test set)"
|
750 |
+
]
|
751 |
+
},
|
752 |
+
{
|
753 |
+
"cell_type": "markdown",
|
754 |
+
"metadata": {},
|
755 |
+
"source": [
|
756 |
+
"# Check if your GPU is available"
|
757 |
+
]
|
758 |
+
},
|
759 |
+
{
|
760 |
+
"cell_type": "code",
|
761 |
+
"execution_count": 5,
|
762 |
+
"metadata": {},
|
763 |
+
"outputs": [
|
764 |
+
{
|
765 |
+
"data": {
|
766 |
+
"text/plain": [
|
767 |
+
"True"
|
768 |
+
]
|
769 |
+
},
|
770 |
+
"execution_count": 5,
|
771 |
+
"metadata": {},
|
772 |
+
"output_type": "execute_result"
|
773 |
+
}
|
774 |
+
],
|
775 |
+
"source": [
|
776 |
+
"import torch\n",
|
777 |
+
"torch.cuda.is_available()"
|
778 |
+
]
|
779 |
+
},
|
780 |
+
{
|
781 |
+
"cell_type": "markdown",
|
782 |
+
"metadata": {},
|
783 |
+
"source": [
|
784 |
+
"# Try Exporting the model"
|
785 |
+
]
|
786 |
+
},
|
787 |
+
{
|
788 |
+
"cell_type": "markdown",
|
789 |
+
"metadata": {},
|
790 |
+
"source": [
|
791 |
+
"#### How to pass input to the model for inference"
|
792 |
+
]
|
793 |
+
},
|
794 |
+
{
|
795 |
+
"cell_type": "code",
|
796 |
+
"execution_count": null,
|
797 |
+
"metadata": {},
|
798 |
+
"outputs": [
|
799 |
+
{
|
800 |
+
"name": "stdout",
|
801 |
+
"output_type": "stream",
|
802 |
+
"text": [
|
803 |
+
"SequenceClassifierOutput(loss={'logits': tensor([[0.6899]], device='cuda:0')}, logits=tensor([[0.6899]], device='cuda:0'), hidden_states=None, attentions=None)\n"
|
804 |
+
]
|
805 |
+
}
|
806 |
+
],
|
807 |
+
"source": [
|
808 |
+
"import torch\n",
|
809 |
+
"\n",
|
810 |
+
"# Use GPU if available, otherwise fall back to CPU\n",
|
811 |
+
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
|
812 |
+
"\n",
|
813 |
+
"# Move the model to the same device\n",
|
814 |
+
"my_model.to(device)\n",
|
815 |
+
"\n",
|
816 |
+
"# Tokenize input and ensure tensors are returned\n",
|
817 |
+
"sentence = \"Hey, it's Geetansh\"\n",
|
818 |
+
"output = tokz(sentence, return_tensors='pt')\n",
|
819 |
+
"\n",
|
820 |
+
"# Move input tensors to the same device as the model\n",
|
821 |
+
"output = {key: val.to(device) for key, val in output.items()}\n",
|
822 |
+
"# print(output)\n",
|
823 |
+
"\n",
|
824 |
+
"# Set model to evaluation mode\n",
|
825 |
+
"my_model.eval()\n",
|
826 |
+
"\n",
|
827 |
+
"# Perform inference without tracking gradients\n",
|
828 |
+
"with torch.no_grad():\n",
|
829 |
+
" # Pass tokenized input to the model\n",
|
830 |
+
" predictions = my_model(**output)\n",
|
831 |
+
"\n",
|
832 |
+
"# Print predictions\n",
|
833 |
+
"print(predictions)\n"
|
834 |
+
]
|
835 |
+
},
|
836 |
+
{
|
837 |
+
"cell_type": "markdown",
|
838 |
+
"metadata": {},
|
839 |
+
"source": [
|
840 |
+
"### Method 1"
|
841 |
+
]
|
842 |
+
},
|
843 |
+
{
|
844 |
+
"cell_type": "code",
|
845 |
+
"execution_count": null,
|
846 |
+
"metadata": {},
|
847 |
+
"outputs": [
|
848 |
+
{
|
849 |
+
"name": "stdout",
|
850 |
+
"output_type": "stream",
|
851 |
+
"text": [
|
852 |
+
"SequenceClassifierOutput(loss=None, logits=tensor([[0.3520]], device='cuda:0'), hidden_states=None, attentions=None)\n"
|
853 |
+
]
|
854 |
+
}
|
855 |
+
],
|
856 |
+
"source": [
|
857 |
+
"# Save the model and tokeniser to disk\n",
|
858 |
+
"save_dir = \"./saved_model\"\n",
|
859 |
+
"# tokz.save_pretrained(save_directory=save_dir)\n",
|
860 |
+
"# my_model.save_pretrained(save_directory=save_dir)\n",
|
861 |
+
"\n",
|
862 |
+
"# Use GPU if available, otherwise fall back to CPU\n",
|
863 |
+
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
|
864 |
+
"\n",
|
865 |
+
"# Load the saved model and tokeniser from the disk \n",
|
866 |
+
"from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
|
867 |
+
"loaded_tokeniser = AutoTokenizer.from_pretrained(save_dir)\n",
|
868 |
+
"loaded_model = AutoModelForSequenceClassification.from_pretrained(save_dir)\n",
|
869 |
+
"\n",
|
870 |
+
"loaded_model.to(device)\n",
|
871 |
+
"\n",
|
872 |
+
"# Test with the dummy input\n",
|
873 |
+
"# Create a dummy input (same structure as your tokenizer output)\n",
|
874 |
+
"dummy_input = loaded_tokeniser(\"This is a test sentence.\", return_tensors='pt')\n",
|
875 |
+
"dummy_input = {key: val.to(device) for key, val in dummy_input.items()}\n",
|
876 |
+
"\n",
|
877 |
+
"with torch.no_grad():\n",
|
878 |
+
" output = loaded_model(**dummy_input)\n",
|
879 |
+
"print(output) "
|
880 |
+
]
|
881 |
+
},
|
882 |
+
{
|
883 |
+
"cell_type": "markdown",
|
884 |
+
"metadata": {},
|
885 |
+
"source": [
|
886 |
+
"### Method 2"
|
887 |
+
]
|
888 |
+
},
|
889 |
+
{
|
890 |
+
"cell_type": "code",
|
891 |
+
"execution_count": null,
|
892 |
+
"metadata": {},
|
893 |
+
"outputs": [
|
894 |
+
{
|
895 |
+
"name": "stdout",
|
896 |
+
"output_type": "stream",
|
897 |
+
"text": [
|
898 |
+
"SequenceClassifierOutput(loss=None, logits=tensor([[0.3520]], device='cuda:0'), hidden_states=None, attentions=None)\n"
|
899 |
+
]
|
900 |
+
}
|
901 |
+
],
|
902 |
+
"source": [
|
903 |
+
"# Save the model and tokeniser to disk\n",
|
904 |
+
"save_dir = \"./saved_model2\"\n",
|
905 |
+
"# trainer.save_model(save_dir)\n",
|
906 |
+
"\n",
|
907 |
+
"# Use GPU if available, otherwise fall back to CPU\n",
|
908 |
+
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
|
909 |
+
"\n",
|
910 |
+
"# Load the saved model and tokeniser from the disk \n",
|
911 |
+
"from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
|
912 |
+
"loaded_tokeniser = AutoTokenizer.from_pretrained(save_dir)\n",
|
913 |
+
"loaded_model = AutoModelForSequenceClassification.from_pretrained(save_dir)\n",
|
914 |
+
"\n",
|
915 |
+
"loaded_model.to(device)\n",
|
916 |
+
"\n",
|
917 |
+
"# Test with the same dummy input as before\n",
|
918 |
+
"# Create a dummy input (same structure as your tokenizer output)\n",
|
919 |
+
"dummy_input = loaded_tokeniser(\"This is a test sentence.\", return_tensors='pt')\n",
|
920 |
+
"dummy_input = {key: val.to(device) for key, val in dummy_input.items()}\n",
|
921 |
+
"\n",
|
922 |
+
"with torch.no_grad():\n",
|
923 |
+
" output = loaded_model(**dummy_input)\n",
|
924 |
+
"print(output) "
|
925 |
+
]
|
926 |
+
}
|
927 |
+
],
|
928 |
+
"metadata": {
|
929 |
+
"kernelspec": {
|
930 |
+
"display_name": "venv",
|
931 |
+
"language": "python",
|
932 |
+
"name": "python3"
|
933 |
+
},
|
934 |
+
"language_info": {
|
935 |
+
"codemirror_mode": {
|
936 |
+
"name": "ipython",
|
937 |
+
"version": 3
|
938 |
+
},
|
939 |
+
"file_extension": ".py",
|
940 |
+
"mimetype": "text/x-python",
|
941 |
+
"name": "python",
|
942 |
+
"nbconvert_exporter": "python",
|
943 |
+
"pygments_lexer": "ipython3",
|
944 |
+
"version": "3.12.6"
|
945 |
+
}
|
946 |
+
},
|
947 |
+
"nbformat": 4,
|
948 |
+
"nbformat_minor": 2
|
949 |
+
}
|