basantcuraj commited on
Commit
b9b1934
·
verified ·
1 Parent(s): 541d2e5

Upload Lesson_2_Student.ipynb

Browse files
Files changed (1) hide show
  1. Lesson_2_Student.ipynb +353 -0
Lesson_2_Student.ipynb ADDED
@@ -0,0 +1,353 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "6ef57eb5-2295-44fb-81d9-4351769e8f4e",
6
+ "metadata": {},
7
+ "source": [
8
+ "# L2: Normalizing the Content"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "1138a09c-e3b2-46e0-9d42-e46a93ac534f",
14
+ "metadata": {},
15
+ "source": [
16
+ "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>\n"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": null,
22
+ "id": "4ec2430d-cac6-406e-a007-123dc97df0e2",
23
+ "metadata": {},
24
+ "outputs": [],
25
+ "source": [
26
+ "# Warning control\n",
27
+ "import warnings\n",
28
+ "warnings.filterwarnings('ignore')"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": null,
34
+ "id": "591f213d-f54b-4c91-ae73-bcd2c7a4359e",
35
+ "metadata": {},
36
+ "outputs": [],
37
+ "source": [
38
+ "from IPython.display import JSON\n",
39
+ "\n",
40
+ "import json\n",
41
+ "\n",
42
+ "from unstructured_client import UnstructuredClient\n",
43
+ "from unstructured_client.models import shared\n",
44
+ "from unstructured_client.models.errors import SDKError\n",
45
+ "\n",
46
+ "from unstructured.partition.html import partition_html\n",
47
+ "from unstructured.partition.pptx import partition_pptx\n",
48
+ "from unstructured.staging.base import dict_to_elements, elements_to_json"
49
+ ]
50
+ },
51
+ {
52
+ "cell_type": "code",
53
+ "execution_count": null,
54
+ "id": "b97be736-2166-400c-9d17-6dcb4034c9eb",
55
+ "metadata": {},
56
+ "outputs": [],
57
+ "source": [
58
+ "from Utils import Utils\n",
59
+ "utils = Utils()\n",
60
+ "\n",
61
+ "DLAI_API_KEY = utils.get_dlai_api_key()\n",
62
+ "DLAI_API_URL = utils.get_dlai_url()\n",
63
+ "\n",
64
+ "s = UnstructuredClient(\n",
65
+ " api_key_auth=DLAI_API_KEY,\n",
66
+ " server_url=DLAI_API_URL,\n",
67
+ ")"
68
+ ]
69
+ },
70
+ {
71
+ "cell_type": "markdown",
72
+ "id": "be8a0cbd-1545-4dde-ab60-84dcdaa4a51a",
73
+ "metadata": {},
74
+ "source": [
75
+ "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Access Utils File and Helper Functions:</b> To access helper functions and other related files for this notebook, 1) click on the <em>\"View\"</em> option on the top menu of the notebook and then 2) click on <em>\"File Browser\"</em>. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>\n"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "markdown",
80
+ "id": "ee01e76f-3773-4611-9ad8-5f990ab23eef",
81
+ "metadata": {},
82
+ "source": [
83
+ "## Example Document: Medium Blog HTML Page"
84
+ ]
85
+ },
86
+ {
87
+ "cell_type": "code",
88
+ "execution_count": null,
89
+ "id": "8753e8a4-859c-41eb-8b5e-101a30ed5b2d",
90
+ "metadata": {},
91
+ "outputs": [],
92
+ "source": [
93
+ "from IPython.display import Image\n",
94
+ "Image(filename=\"images/HTML_demo.png\", height=600, width=600)"
95
+ ]
96
+ },
97
+ {
98
+ "cell_type": "code",
99
+ "execution_count": null,
100
+ "id": "0265c17e-d9db-4877-a804-74d94596a72e",
101
+ "metadata": {},
102
+ "outputs": [],
103
+ "source": [
104
+ "filename = \"example_files/medium_blog.html\"\n",
105
+ "elements = partition_html(filename=filename)"
106
+ ]
107
+ },
108
+ {
109
+ "cell_type": "code",
110
+ "execution_count": null,
111
+ "id": "68f55cc9-0d04-4260-a617-9f3ab0c6951f",
112
+ "metadata": {},
113
+ "outputs": [],
114
+ "source": [
115
+ "element_dict = [el.to_dict() for el in elements]\n",
116
+ "example_output = json.dumps(element_dict[11:15], indent=2)\n",
117
+ "print(example_output)"
118
+ ]
119
+ },
120
+ {
121
+ "cell_type": "code",
122
+ "execution_count": null,
123
+ "id": "41aa4434-7ff6-45c5-bf7d-ef870358e4bb",
124
+ "metadata": {},
125
+ "outputs": [],
126
+ "source": [
127
+ "JSON(example_output)"
128
+ ]
129
+ },
130
+ {
131
+ "cell_type": "markdown",
132
+ "id": "5e571077-9c60-499c-9668-c0e350148ce8",
133
+ "metadata": {},
134
+ "source": [
135
+ "## Example Doc: MSFT PowerPoint on OpenAI"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "code",
140
+ "execution_count": null,
141
+ "id": "36692869-ceeb-4eee-88bf-bdbebb198c70",
142
+ "metadata": {},
143
+ "outputs": [],
144
+ "source": [
145
+ "Image(filename=\"images/pptx_slide.png\", height=600, width=600) "
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": null,
151
+ "id": "60fffa73-2f75-447a-8596-e24cc3288039",
152
+ "metadata": {},
153
+ "outputs": [],
154
+ "source": [
155
+ "filename = \"example_files/msft_openai.pptx\"\n",
156
+ "elements = partition_pptx(filename=filename)"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "code",
161
+ "execution_count": null,
162
+ "id": "3589bae1-3f4e-4b9e-b16c-8975e37287c9",
163
+ "metadata": {},
164
+ "outputs": [],
165
+ "source": [
166
+ "element_dict = [el.to_dict() for el in elements]\n",
167
+ "JSON(json.dumps(element_dict[:], indent=2))"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "markdown",
172
+ "id": "b5ecce31-081c-4e51-89ae-06cc8d9eaf2d",
173
+ "metadata": {},
174
+ "source": [
175
+ "## Example Document: PDF on Chain-of-Thought"
176
+ ]
177
+ },
178
+ {
179
+ "cell_type": "code",
180
+ "execution_count": null,
181
+ "id": "037167ed-21bf-4af3-84a4-21f4b88a0b1c",
182
+ "metadata": {},
183
+ "outputs": [],
184
+ "source": [
185
+ "Image(filename=\"images/cot_paper.png\", height=600, width=600) "
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "code",
190
+ "execution_count": null,
191
+ "id": "dcab4d73-f062-47ff-bc43-43b50753be6d",
192
+ "metadata": {},
193
+ "outputs": [],
194
+ "source": [
195
+ "filename = \"example_files/CoT.pdf\"\n",
196
+ "with open(filename, \"rb\") as f:\n",
197
+ " files=shared.Files(\n",
198
+ " content=f.read(), \n",
199
+ " file_name=filename,\n",
200
+ " )\n",
201
+ "\n",
202
+ "req = shared.PartitionParameters(\n",
203
+ " files=files,\n",
204
+ " strategy='hi_res',\n",
205
+ " pdf_infer_table_structure=True,\n",
206
+ " languages=[\"eng\"],\n",
207
+ ")\n",
208
+ "try:\n",
209
+ " resp = s.general.partition(req)\n",
210
+ " print(json.dumps(resp.elements[:3], indent=2))\n",
211
+ "except SDKError as e:\n",
212
+ " print(e)"
213
+ ]
214
+ },
215
+ {
216
+ "cell_type": "code",
217
+ "execution_count": null,
218
+ "id": "7c44f3c8-cd94-440b-a9cc-01db643f44d9",
219
+ "metadata": {},
220
+ "outputs": [],
221
+ "source": [
222
+ "JSON(json.dumps(resp.elements, indent=2))"
223
+ ]
224
+ },
225
+ {
226
+ "cell_type": "markdown",
227
+ "id": "f9e21878-24e1-4967-b5da-c3e0cdf8a69a",
228
+ "metadata": {},
229
+ "source": [
230
+ "## Work With Your Own Files"
231
+ ]
232
+ },
233
+ {
234
+ "cell_type": "code",
235
+ "execution_count": null,
236
+ "id": "66704bc0-e59e-438e-9fef-2a0a7cffc376",
237
+ "metadata": {},
238
+ "outputs": [],
239
+ "source": [
240
+ "import panel as pn\n",
241
+ "#import param\n",
242
+ "from Utils import upld_file\n",
243
+ "pn.extension()"
244
+ ]
245
+ },
246
+ {
247
+ "cell_type": "code",
248
+ "execution_count": null,
249
+ "id": "d8b8e99a-4d73-4d24-87a5-55fb8783867c",
250
+ "metadata": {},
251
+ "outputs": [],
252
+ "source": [
253
+ "upld_widget = upld_file()\n",
254
+ "pn.Row(upld_widget.widget_file_upload)"
255
+ ]
256
+ },
257
+ {
258
+ "cell_type": "markdown",
259
+ "id": "2a04e4fa-6b55-4ff0-ad8b-508e81829b60",
260
+ "metadata": {},
261
+ "source": [
262
+ "<p style=\"background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px\"> 🖥 &nbsp; <b>Note:</b> If the file upload interface isn't functioning properly, the issue may be related to your browser version. In such a case, please ensure your browser is updated to the latest version, or try using a different browser.</p>\n"
263
+ ]
264
+ },
265
+ {
266
+ "cell_type": "code",
267
+ "execution_count": null,
268
+ "id": "daaf85f4-f702-467c-b44b-80dcb98aae2c",
269
+ "metadata": {},
270
+ "outputs": [],
271
+ "source": [
272
+ "!ls ./example_files"
273
+ ]
274
+ },
275
+ {
276
+ "cell_type": "markdown",
277
+ "id": "691324b6-6291-4668-8be1-fb9044bb5b6d",
278
+ "metadata": {},
279
+ "source": [
280
+ "<p style=\"background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px\"> 💻 &nbsp; <b>Uploading Your Own File - Method 2:</b> To upload your own files, you can also 1) click on the <em>\"View\"</em> option on the top menu of the notebook and then 2) click on <em>\"File Browser\"</em>. Then 3) click on <em>\"Upload\"</em> button to upload your files. For more help, please see the <em>\"Appendix - Tips and Help\"</em> Lesson.</p>"
281
+ ]
282
+ },
283
+ {
284
+ "cell_type": "code",
285
+ "execution_count": null,
286
+ "id": "bac37496-46de-4a6c-a2c2-b4749722da64",
287
+ "metadata": {},
288
+ "outputs": [],
289
+ "source": []
290
+ },
291
+ {
292
+ "cell_type": "code",
293
+ "execution_count": null,
294
+ "id": "ba1415fc-e652-4ecc-94ac-0867d101c919",
295
+ "metadata": {},
296
+ "outputs": [],
297
+ "source": []
298
+ },
299
+ {
300
+ "cell_type": "code",
301
+ "execution_count": null,
302
+ "id": "11073a73-f21d-4e1e-94a7-1f4a57cbc95f",
303
+ "metadata": {},
304
+ "outputs": [],
305
+ "source": []
306
+ },
307
+ {
308
+ "cell_type": "code",
309
+ "execution_count": null,
310
+ "id": "a20d475a-1185-4185-9f57-794e1536e4c9",
311
+ "metadata": {},
312
+ "outputs": [],
313
+ "source": []
314
+ },
315
+ {
316
+ "cell_type": "code",
317
+ "execution_count": null,
318
+ "id": "afd68abd-967a-4ee1-bd12-0309e595cae4",
319
+ "metadata": {},
320
+ "outputs": [],
321
+ "source": []
322
+ },
323
+ {
324
+ "cell_type": "code",
325
+ "execution_count": null,
326
+ "id": "2eee2322-6829-4d3a-9666-6b6440c7d074",
327
+ "metadata": {},
328
+ "outputs": [],
329
+ "source": []
330
+ }
331
+ ],
332
+ "metadata": {
333
+ "kernelspec": {
334
+ "display_name": "Python 3 (ipykernel)",
335
+ "language": "python",
336
+ "name": "python3"
337
+ },
338
+ "language_info": {
339
+ "codemirror_mode": {
340
+ "name": "ipython",
341
+ "version": 3
342
+ },
343
+ "file_extension": ".py",
344
+ "mimetype": "text/x-python",
345
+ "name": "python",
346
+ "nbconvert_exporter": "python",
347
+ "pygments_lexer": "ipython3",
348
+ "version": "3.9.6"
349
+ }
350
+ },
351
+ "nbformat": 4,
352
+ "nbformat_minor": 5
353
+ }