Ilayda-j commited on
Commit
10de3c1
1 Parent(s): 04d1ce0

Upload 27 files

Browse files
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright 2022, fastai
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
MANIFEST.in ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include settings.ini
2
+ include LICENSE
3
+ include CONTRIBUTING.md
4
+ include README.md
5
+ recursive-exclude * __pycache__
ai_classroom_suite/IOHelperUtilities.py ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/helper_utilities.ipynb.
2
+
3
+ # %% auto 0
4
+ __all__ = ['check_is_colab', 'MultiFileChooser', 'setup_drives']
5
+
6
+ # %% ../nbs/helper_utilities.ipynb 3
7
+ import ipywidgets as widgets
8
+ from IPython.display import display, clear_output
9
+ from functools import partial
10
+ from ipyfilechooser import FileChooser
11
+ import os
12
+
13
+ # %% ../nbs/helper_utilities.ipynb 4
14
+ def check_is_colab():
15
+ """
16
+ Check if the current environment is Google Colab.
17
+ """
18
+ try:
19
+ import google.colab
20
+ return True
21
+ except:
22
+ return False
23
+
24
+ # %% ../nbs/helper_utilities.ipynb 7
25
+ class MultiFileChooser:
26
+ def __init__(self):
27
+ self.fc = FileChooser('.')
28
+ self.fc.title = "Use the following file chooser to add each file individually.\n You can remove files by clicking the remove button."
29
+ self.fc.use_dir_icons = True
30
+ self.fc.show_only_dirs = False
31
+ self.selected_files = []
32
+
33
+ self.fc.register_callback(self.file_selected)
34
+
35
+ self.output = widgets.Output()
36
+
37
+ def file_selected(self, chooser):
38
+ if self.fc.selected is not None and self.fc.selected not in self.selected_files:
39
+ self.selected_files.append(self.fc.selected)
40
+ self.update_display()
41
+
42
+ def update_display(self):
43
+ with self.output:
44
+ clear_output()
45
+ for this_file in self.selected_files:
46
+ remove_button = widgets.Button(description="Remove", tooltip="Remove this file")
47
+ remove_button.on_click(partial(self.remove_file, file=this_file))
48
+ display(widgets.HBox([widgets.Label(value=this_file), remove_button]))
49
+
50
+ def remove_file(self, button, this_file):
51
+ if this_file in self.selected_files:
52
+ self.selected_files.remove(this_file)
53
+ self.update_display()
54
+
55
+ def display(self):
56
+ display(self.fc, self.output)
57
+
58
+ def get_selected_files(self):
59
+ return self.selected_files
60
+
61
+ # %% ../nbs/helper_utilities.ipynb 12
62
+ def setup_drives(upload_set):
63
+
64
+ upload_set = upload_set.lower()
65
+ uploaded = None
66
+
67
+ # allow them to mount the drive if they chose Google Colab.
68
+ if upload_set == 'google drive':
69
+ if check_is_colab():
70
+ from google.colab import drive
71
+ drive.mount('/content/drive')
72
+ else:
73
+ raise ValueError("It looks like you're not on Google Colab. Google Drive mounting is currently only implemented for Google Colab.")
74
+
75
+ # Everything else means that they'll need to use a file chooser (including Google Drive)
76
+ if check_is_colab():
77
+ from google.colab import files
78
+ uploaded = files.upload()
79
+ else:
80
+ # Create file chooser and interact
81
+ mfc = MultiFileChooser()
82
+ mfc.display()
83
+ uploaded = mfc.get_selected_files()
84
+
85
+ return uploaded
ai_classroom_suite/MediaVectorStores.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/media_stores.ipynb.
2
+
3
+ # %% auto 0
4
+ __all__ = ['rawtext_to_doc_split', 'files_to_text', 'youtube_to_text', 'save_text', 'get_youtube_transcript',
5
+ 'website_to_text_web', 'website_to_text_unstructured', 'get_document_segments', 'create_local_vector_store']
6
+
7
+ # %% ../nbs/media_stores.ipynb 3
8
+ # import libraries here
9
+ import os
10
+ import itertools
11
+
12
+ from langchain.embeddings import OpenAIEmbeddings
13
+
14
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
15
+ from langchain.document_loaders.unstructured import UnstructuredFileLoader
16
+ from langchain.document_loaders.generic import GenericLoader
17
+ from langchain.document_loaders.parsers import OpenAIWhisperParser
18
+ from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
19
+ from langchain.document_loaders import WebBaseLoader, UnstructuredURLLoader
20
+ from langchain.docstore.document import Document
21
+
22
+ from langchain.vectorstores import Chroma
23
+ from langchain.chains import RetrievalQAWithSourcesChain
24
+
25
+ # %% ../nbs/media_stores.ipynb 8
26
+ def rawtext_to_doc_split(text, chunk_size=1500, chunk_overlap=150):
27
+
28
+ # Quick type checking
29
+ if not isinstance(text, list):
30
+ text = [text]
31
+
32
+ # Create splitter
33
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
34
+ chunk_overlap=chunk_overlap,
35
+ add_start_index = True)
36
+
37
+ #Split into docs segments
38
+ if isinstance(text[0], Document):
39
+ doc_segments = text_splitter.split_documents(text)
40
+ else:
41
+ doc_segments = text_splitter.split_documents(text_splitter.create_documents(text))
42
+
43
+ # Make into one big list
44
+ doc_segments = list(itertools.chain(*doc_segments)) if isinstance(doc_segments[0], list) else doc_segments
45
+
46
+ return doc_segments
47
+
48
+ # %% ../nbs/media_stores.ipynb 16
49
+ ## A single File
50
+ def _file_to_text(single_file, chunk_size = 1000, chunk_overlap=150):
51
+
52
+ # Create loader and get segments
53
+ loader = UnstructuredFileLoader(single_file)
54
+ doc_segments = loader.load_and_split(RecursiveCharacterTextSplitter(chunk_size=chunk_size,
55
+ chunk_overlap=chunk_overlap,
56
+ add_start_index=True))
57
+ return doc_segments
58
+
59
+
60
+ ## Multiple files
61
+ def files_to_text(files_list, chunk_size=1000, chunk_overlap=150):
62
+
63
+ # Quick type checking
64
+ if not isinstance(files_list, list):
65
+ files_list = [files_list]
66
+
67
+ # This is currently a fix because the UnstructuredFileLoader expects a list of files yet can't split them correctly yet
68
+ all_segments = [_file_to_text(single_file, chunk_size=chunk_size, chunk_overlap=chunk_overlap) for single_file in files_list]
69
+ all_segments = list(itertools.chain(*all_segments)) if isinstance(all_segments[0], list) else all_segments
70
+
71
+ return all_segments
72
+
73
+ # %% ../nbs/media_stores.ipynb 20
74
+ def youtube_to_text(urls, save_dir = "content"):
75
+ # Transcribe the videos to text
76
+ # save_dir: directory to save audio files
77
+
78
+ if not isinstance(urls, list):
79
+ urls = [urls]
80
+
81
+ youtube_loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())
82
+ youtube_docs = youtube_loader.load()
83
+
84
+ return youtube_docs
85
+
86
+ # %% ../nbs/media_stores.ipynb 24
87
+ def save_text(text, text_name = None):
88
+ if not text_name:
89
+ text_name = text[:20]
90
+ text_path = os.path.join("/content",text_name+".txt")
91
+
92
+ with open(text_path, "x") as f:
93
+ f.write(text)
94
+ # Return the location at which the transcript is saved
95
+ return text_path
96
+
97
+ # %% ../nbs/media_stores.ipynb 25
98
+ def get_youtube_transcript(yt_url, save_transcript = False, temp_audio_dir = "sample_data"):
99
+ # Transcribe the videos to text and save to file in /content
100
+ # save_dir: directory to save audio files
101
+
102
+ youtube_docs = youtube_to_text(yt_url, save_dir = temp_audio_dir)
103
+
104
+ # Combine doc
105
+ combined_docs = [doc.page_content for doc in youtube_docs]
106
+ combined_text = " ".join(combined_docs)
107
+
108
+ # Save text to file
109
+ video_path = youtube_docs[0].metadata["source"]
110
+ youtube_name = os.path.splitext(os.path.basename(video_path))[0]
111
+
112
+ save_path = None
113
+ if save_transcript:
114
+ save_path = save_text(combined_text, youtube_name)
115
+
116
+ return youtube_docs, save_path
117
+
118
+ # %% ../nbs/media_stores.ipynb 27
119
+ def website_to_text_web(url, chunk_size = 1500, chunk_overlap=100):
120
+
121
+ # Url can be a single string or list
122
+ website_loader = WebBaseLoader(url)
123
+ website_raw = website_loader.load()
124
+
125
+ website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)
126
+
127
+ # Combine doc
128
+ return website_data
129
+
130
+ # %% ../nbs/media_stores.ipynb 33
131
+ def website_to_text_unstructured(web_urls, chunk_size = 1500, chunk_overlap=100):
132
+
133
+ # Make sure it's a list
134
+ if not isinstance(web_urls, list):
135
+ web_urls = [web_urls]
136
+
137
+ # Url can be a single string or list
138
+ website_loader = UnstructuredURLLoader(web_urls)
139
+ website_raw = website_loader.load()
140
+
141
+ website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)
142
+
143
+ # Return individual docs or list
144
+ return website_data
145
+
146
+ # %% ../nbs/media_stores.ipynb 45
147
+ def get_document_segments(context_info, data_type, chunk_size = 1500, chunk_overlap=100):
148
+
149
+ load_fcn = None
150
+ addtnl_params = {'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}
151
+
152
+ # Define function use to do the loading
153
+ if data_type == 'text':
154
+ load_fcn = rawtext_to_doc_split
155
+ elif data_type == 'web_page':
156
+ load_fcn = website_to_text_unstructured
157
+ elif data_type == 'youtube_video':
158
+ load_fcn = youtube_to_text
159
+ else:
160
+ load_fcn = files_to_text
161
+
162
+ # Get the document segments
163
+ doc_segments = load_fcn(context_info, **addtnl_params)
164
+
165
+ return doc_segments
166
+
167
+ # %% ../nbs/media_stores.ipynb 47
168
+ def create_local_vector_store(document_segments, **retriever_kwargs):
169
+ embeddings = OpenAIEmbeddings()
170
+ db = Chroma.from_documents(document_segments, embeddings)
171
+ retriever = db.as_retriever(**retriever_kwargs)
172
+
173
+ return db, retriever
ai_classroom_suite/PromptInteractionBase.py ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/prompt_interaction_base.ipynb.
2
+
3
+ # %% auto 0
4
+ __all__ = ['SYSTEM_TUTOR_TEMPLATE', 'HUMAN_RESPONSE_TEMPLATE', 'HUMAN_RETRIEVER_RESPONSE_TEMPLATE', 'DEFAULT_ASSESSMENT_MSG',
5
+ 'DEFAULT_LEARNING_OBJS_MSG', 'DEFAULT_CONDENSE_PROMPT_TEMPLATE', 'DEFAULT_QUESTION_PROMPT_TEMPLATE',
6
+ 'DEFAULT_COMBINE_PROMPT_TEMPLATE', 'create_model', 'set_openai_key', 'create_base_tutoring_prompt',
7
+ 'get_tutoring_prompt', 'get_tutoring_answer', 'create_tutor_mdl_chain']
8
+
9
+ # %% ../nbs/prompt_interaction_base.ipynb 3
10
+ from langchain.chat_models import ChatOpenAI
11
+ from langchain.llms import OpenAI
12
+
13
+ from langchain import PromptTemplate
14
+ from langchain.prompts import ChatPromptTemplate, PromptTemplate
15
+ from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate
16
+ from langchain.chains import LLMChain, ConversationalRetrievalChain, RetrievalQAWithSourcesChain
17
+ from langchain.chains.base import Chain
18
+
19
+ from getpass import getpass
20
+
21
+ import os
22
+
23
+ # %% ../nbs/prompt_interaction_base.ipynb 5
24
+ def create_model(openai_mdl='gpt-3.5-turbo-16k', temperature=0.1, **chatopenai_kwargs):
25
+ llm = ChatOpenAI(model_name = openai_mdl, temperature=temperature, **chatopenai_kwargs)
26
+
27
+ return llm
28
+
29
+ # %% ../nbs/prompt_interaction_base.ipynb 6
30
+ def set_openai_key():
31
+ openai_api_key = getpass()
32
+ os.environ["OPENAI_API_KEY"] = openai_api_key
33
+
34
+ return
35
+
36
+ # %% ../nbs/prompt_interaction_base.ipynb 10
37
+ # Create system prompt template
38
+ SYSTEM_TUTOR_TEMPLATE = ("You are a world-class tutor helping students to perform better on oral and written exams though interactive experiences. " +
39
+ "When assessing and evaluating students, you always ask one question at a time, and wait for the student's response before " +
40
+ "providing them with feedback. Asking one question at a time, waiting for the student's response, and then commenting " +
41
+ "on the strengths and weaknesses of their responses (when appropriate) is what makes you such a sought-after, world-class tutor.")
42
+
43
+ # Create a human response template
44
+ HUMAN_RESPONSE_TEMPLATE = ("I'm trying to better understand the text provided below. {assessment_request} The learning objectives to be assessed are: " +
45
+ "{learning_objectives}. Although I may request more than one assessment question, you should " +
46
+ "only provide ONE question in you initial response. Do not include the answer in your response. " +
47
+ "If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional " +
48
+ "chances to respond until I get the correct choice. Explain why the correct choice is right. " +
49
+ "The text that you will base your questions on is as follows: {context}.")
50
+
51
+ HUMAN_RETRIEVER_RESPONSE_TEMPLATE = ("I want to master the topics based on the excerpts of the text below. Given the following extracted text from long documents, {assessment_request} The learning objectives to be assessed are: " +
52
+ "{learning_objectives}. Although I may request more than one assessment question, you should " +
53
+ "only provide ONE question in you initial response. Do not include the answer in your response. " +
54
+ "If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional " +
55
+ "chances to respond until I get the correct choice. Explain why the correct choice is right. " +
56
+ "The extracted text from long documents are as follows: {summaries}.")
57
+
58
+ def create_base_tutoring_prompt(system_prompt=None, human_prompt=None):
59
+
60
+ #setup defaults using defined values
61
+ if system_prompt == None:
62
+ system_prompt = PromptTemplate(template = SYSTEM_TUTOR_TEMPLATE,
63
+ input_variables = [])
64
+
65
+ if human_prompt==None:
66
+ human_prompt = PromptTemplate(template = HUMAN_RESPONSE_TEMPLATE,
67
+ input_variables=['assessment_request', 'learning_objectives', 'context'])
68
+
69
+ # Create prompt messages
70
+ system_tutor_msg = SystemMessagePromptTemplate(prompt=system_prompt)
71
+ human_tutor_msg = HumanMessagePromptTemplate(prompt= human_prompt)
72
+
73
+ # Create ChatPromptTemplate
74
+ chat_prompt = ChatPromptTemplate.from_messages([system_tutor_msg, human_tutor_msg])
75
+
76
+ return chat_prompt
77
+
78
+ # %% ../nbs/prompt_interaction_base.ipynb 14
79
+ DEFAULT_ASSESSMENT_MSG = 'Please design a 5 question short answer quiz about the provided text.'
80
+ DEFAULT_LEARNING_OBJS_MSG = 'Identify and comprehend the important topics and underlying messages and connections within the text'
81
+
82
+ def get_tutoring_prompt(context, chat_template=None, assessment_request = None, learning_objectives = None, **kwargs):
83
+
84
+ # set defaults
85
+ if chat_template is None:
86
+ chat_template = create_base_tutoring_prompt()
87
+ else:
88
+ if not all([prompt_var in chat_template.input_variables
89
+ for prompt_var in ['context', 'assessment_request', 'learning_objectives']]):
90
+ raise KeyError('''It looks like you may have a custom chat_template. Either include context, assessment_request, and learning objectives
91
+ as input variables or create your own tutoring prompt.''')
92
+
93
+ if assessment_request is None and 'assessment_request':
94
+ assessment_request = DEFAULT_ASSESSMENT_MSG
95
+
96
+ if learning_objectives is None:
97
+ learning_objectives = DEFAULT_LEARNING_OBJS_MSG
98
+
99
+ # compose final prompt
100
+ tutoring_prompt = chat_template.format_prompt(context=context,
101
+ assessment_request = assessment_request,
102
+ learning_objectives = learning_objectives,
103
+ **kwargs)
104
+
105
+ return tutoring_prompt
106
+
107
+
108
+ # %% ../nbs/prompt_interaction_base.ipynb 18
109
+ def get_tutoring_answer(context, tutor_mdl, chat_template=None, assessment_request=None, learning_objectives=None, return_dict=False, call_kwargs={}, input_kwargs={}):
110
+
111
+ # Get answer from chat
112
+
113
+ # set defaults
114
+ if assessment_request is None:
115
+ assessment_request = DEFAULT_ASSESSMENT_MSG
116
+ if learning_objectives is None:
117
+ learning_objectives = DEFAULT_LEARNING_OBJS_MSG
118
+
119
+ common_inputs = {'assessment_request':assessment_request, 'learning_objectives':learning_objectives}
120
+
121
+ # get answer based on interaction type
122
+ if isinstance(tutor_mdl, ChatOpenAI):
123
+ human_ask_prompt = get_tutoring_prompt(context, chat_template, assessment_request, learning_objectives)
124
+ tutor_answer = tutor_mdl(human_ask_prompt.to_messages())
125
+
126
+ if not return_dict:
127
+ final_answer = tutor_answer.content
128
+
129
+ elif isinstance(tutor_mdl, Chain):
130
+ if isinstance(tutor_mdl, RetrievalQAWithSourcesChain):
131
+ if 'question' not in input_kwargs.keys():
132
+ common_inputs['question'] = assessment_request
133
+ final_inputs = {**common_inputs, **input_kwargs}
134
+ else:
135
+ common_inputs['context'] = context
136
+ final_inputs = {**common_inputs, **input_kwargs}
137
+
138
+ # get answer
139
+ tutor_answer = tutor_mdl(final_inputs, **call_kwargs)
140
+ final_answer = tutor_answer
141
+
142
+ if not return_dict:
143
+ final_answer = final_answer['answer']
144
+
145
+ else:
146
+ raise NotImplementedError(f"tutor_mdl of type {type(tutor_mdl)} is not supported.")
147
+
148
+ return final_answer
149
+
150
+ # %% ../nbs/prompt_interaction_base.ipynb 19
151
+ DEFAULT_CONDENSE_PROMPT_TEMPLATE = ("Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, " +
152
+ "in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:")
153
+
154
+ DEFAULT_QUESTION_PROMPT_TEMPLATE = ("Use the following portion of a long document to see if any of the text is relevant to creating a response to the question." +
155
+ "\nReturn any relevant text verbatim.\n{context}\nQuestion: {question}\nRelevant text, if any:")
156
+
157
+ DEFAULT_COMBINE_PROMPT_TEMPLATE = ("Given the following extracted parts of a long document and the given prompt, create a final answer with references ('SOURCES'). "+
158
+ "If you don't have a response, just say that you are unable to come up with a response. "+
159
+ "\nSOURCES:\n\nQUESTION: {question}\n=========\n{summaries}\n=========\nFINAL ANSWER:'")
160
+
161
+ def create_tutor_mdl_chain(kind='llm', mdl=None, prompt_template = None, **kwargs):
162
+
163
+ #Validate parameters
164
+ if mdl is None:
165
+ mdl = create_model()
166
+ kind = kind.lower()
167
+
168
+ #Create model chain
169
+ if kind == 'llm':
170
+ if prompt_template is None:
171
+ prompt_template = create_base_tutoring_prompt()
172
+ mdl_chain = LLMChain(llm=mdl, prompt=prompt_template, **kwargs)
173
+ elif kind == 'conversational':
174
+ if prompt_template is None:
175
+ prompt_template = PromptTemplate.from_template(DEFAULT_CONDENSE_PROMPT_TEMPLATE)
176
+ mdl_chain = ConversationalRetrieverChain.from_llm(mdl, condense_question_prompt = prompt_template, **kwargs)
177
+ elif kind == 'retrieval_qa':
178
+ if prompt_template is None:
179
+
180
+ #Create custom human prompt to take in summaries
181
+ human_prompt = PromptTemplate(template = HUMAN_RETRIEVER_RESPONSE_TEMPLATE,
182
+ input_variables=['assessment_request', 'learning_objectives', 'summaries'])
183
+ prompt_template = create_base_tutoring_prompt(human_prompt=human_prompt)
184
+
185
+ #Create the combination prompt and model
186
+ question_template = PromptTemplate.from_template(DEFAULT_QUESTION_PROMPT_TEMPLATE)
187
+ mdl_chain = RetrievalQAWithSourcesChain.from_llm(llm=mdl, question_prompt=question_template, combine_prompt = prompt_template, **kwargs)
188
+ else:
189
+ raise NotImplementedError(f"Model kind {kind} not implemented")
190
+
191
+ return mdl_chain
ai_classroom_suite/SelfStudyPrompts.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/self_study_prompts.ipynb.
2
+
3
+ # %% auto 0
4
+ __all__ = ['MC_QUIZ_DEFAULT', 'SHORT_ANSWER_DEFAULT', 'FILL_BLANK_DEFAULT', 'SEQUENCING_DEFAULT', 'RELATIONSHIP_DEFAULT',
5
+ 'CONCEPTS_DEFAULT', 'REAL_WORLD_EXAMPLE_DEFAULT', 'RANDOMIZED_QUESTIONS_DEFAULT', 'SELF_STUDY_PROMPT_NAMES',
6
+ 'SELF_STUDY_DEFAULTS', 'list_all_self_study_prompt_keys', 'list_all_self_study_prompts',
7
+ 'list_default_self_prompt_varnames', 'print_all_self_study_prompts']
8
+
9
+ # %% ../nbs/self_study_prompts.ipynb 4
10
+ # used for pretty display
11
+ import pandas as pd
12
+
13
+ # %% ../nbs/self_study_prompts.ipynb 5
14
+ MC_QUIZ_DEFAULT = "Please design a 5 question multiple choice quiz about the provided text."
15
+
16
+ SHORT_ANSWER_DEFAULT = ("Please design a 5 question short answer quiz about the provided text. "
17
+ "The question types should be short answer. Expect the correct answers to be a few sentences long.")
18
+
19
+ FILL_BLANK_DEFAULT = """Create a 5 question fill in the blank quiz referencing parts of the provided text.
20
+ The "blank" part of the question should appear as "________". The answers should reflect what word(s) should go in the blank an accurate statement.
21
+ An example is as follows: "The author of the book is ______." The question should be a statement.
22
+ """
23
+
24
+ SEQUENCING_DEFAULT = """Create a 5 question questionnaire that will ask me to recall the steps or sequence of events
25
+ in the provided text."""
26
+
27
+ RELATIONSHIP_DEFAULT = ("Create a 5 question quiz for the student that asks the student to identify relationships between"
28
+ "topics or concepts that are important to understanding this text.")
29
+
30
+ CONCEPTS_DEFAULT = """ Design a 5 question quiz that asks me about definitions or concepts of importance in the provided text."""
31
+
32
+ REAL_WORLD_EXAMPLE_DEFAULT = """Demonstrate how the provided context can be applied to solve a real world problem.
33
+ Ask me questions about how the demonstration you provided relates to solving a real world problem."""
34
+
35
+ RANDOMIZED_QUESTIONS_DEFAULT = """Generate a high-quality assessment consisting of 5 varied questions,
36
+ each of different types (open-ended, multiple choice, short answer, analogies, etc.)"""
37
+
38
+ SELF_STUDY_PROMPT_NAMES = ['MC_QUIZ_DEFAULT',
39
+ 'SHORT_ANSWER_DEFAULT',
40
+ 'FILL_BLANK_DEFAULT',
41
+ 'SEQUENCING_DEFAULT',
42
+ 'RELATIONSHIP_DEFAULT',
43
+ 'CONCEPTS_DEFAULT',
44
+ 'REAL_WORLD_EXAMPLE_DEFAULT',
45
+ 'RANDOMIZED_QUESTIONS_DEFAULT']
46
+
47
+ # %% ../nbs/self_study_prompts.ipynb 7
48
+ # Define self study dictionary for lookup
49
+ SELF_STUDY_DEFAULTS = {'mc': MC_QUIZ_DEFAULT,
50
+ 'short_answer': SHORT_ANSWER_DEFAULT,
51
+ 'fill_blank': FILL_BLANK_DEFAULT,
52
+ 'sequencing': SEQUENCING_DEFAULT,
53
+ 'relationships': RELATIONSHIP_DEFAULT,
54
+ 'concepts': CONCEPTS_DEFAULT,
55
+ 'real_world_example': REAL_WORLD_EXAMPLE_DEFAULT,
56
+ 'randomized_questions': RANDOMIZED_QUESTIONS_DEFAULT
57
+ }
58
+
59
+ # Return list of all self study prompts
60
+ def list_all_self_study_prompt_keys():
61
+ return list(SELF_STUDY_DEFAULTS.keys())
62
+
63
+ def list_all_self_study_prompts():
64
+ return list(SELF_STUDY_DEFAULTS.values())
65
+
66
+ # Return list of all self study variable names
67
+ def list_default_self_prompt_varnames():
68
+ return SELF_STUDY_PROMPT_NAMES
69
+
70
+ # Print as a table
71
+ def print_all_self_study_prompts():
72
+ with pd.option_context('max_colwidth', None):
73
+ display(pd.DataFrame({'SELF_STUDY_DEFAULTS key': list(SELF_STUDY_DEFAULTS.keys()),
74
+ 'Prompt': list(SELF_STUDY_DEFAULTS.values())}))
75
+
ai_classroom_suite/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ __version__ = "0.0.1"
ai_classroom_suite/_modidx.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Autogenerated by nbdev
2
+
3
+ d = { 'settings': { 'branch': 'main',
4
+ 'doc_baseurl': '/lo-achievement',
5
+ 'doc_host': 'https://vanderbilt-data-science.github.io',
6
+ 'git_url': 'https://github.com/vanderbilt-data-science/lo-achievement',
7
+ 'lib_path': 'ai_classroom_suite'},
8
+ 'syms': { 'ai_classroom_suite.IOHelperUtilities': { 'ai_classroom_suite.IOHelperUtilities.MultiFileChooser': ( 'helper_utilities.html#multifilechooser',
9
+ 'ai_classroom_suite/IOHelperUtilities.py'),
10
+ 'ai_classroom_suite.IOHelperUtilities.MultiFileChooser.__init__': ( 'helper_utilities.html#multifilechooser.__init__',
11
+ 'ai_classroom_suite/IOHelperUtilities.py'),
12
+ 'ai_classroom_suite.IOHelperUtilities.MultiFileChooser.display': ( 'helper_utilities.html#multifilechooser.display',
13
+ 'ai_classroom_suite/IOHelperUtilities.py'),
14
+ 'ai_classroom_suite.IOHelperUtilities.MultiFileChooser.file_selected': ( 'helper_utilities.html#multifilechooser.file_selected',
15
+ 'ai_classroom_suite/IOHelperUtilities.py'),
16
+ 'ai_classroom_suite.IOHelperUtilities.MultiFileChooser.get_selected_files': ( 'helper_utilities.html#multifilechooser.get_selected_files',
17
+ 'ai_classroom_suite/IOHelperUtilities.py'),
18
+ 'ai_classroom_suite.IOHelperUtilities.MultiFileChooser.remove_file': ( 'helper_utilities.html#multifilechooser.remove_file',
19
+ 'ai_classroom_suite/IOHelperUtilities.py'),
20
+ 'ai_classroom_suite.IOHelperUtilities.MultiFileChooser.update_display': ( 'helper_utilities.html#multifilechooser.update_display',
21
+ 'ai_classroom_suite/IOHelperUtilities.py'),
22
+ 'ai_classroom_suite.IOHelperUtilities.check_is_colab': ( 'helper_utilities.html#check_is_colab',
23
+ 'ai_classroom_suite/IOHelperUtilities.py'),
24
+ 'ai_classroom_suite.IOHelperUtilities.setup_drives': ( 'helper_utilities.html#setup_drives',
25
+ 'ai_classroom_suite/IOHelperUtilities.py')},
26
+ 'ai_classroom_suite.MediaVectorStores': { 'ai_classroom_suite.MediaVectorStores._file_to_text': ( 'media_stores.html#_file_to_text',
27
+ 'ai_classroom_suite/MediaVectorStores.py'),
28
+ 'ai_classroom_suite.MediaVectorStores.create_local_vector_store': ( 'media_stores.html#create_local_vector_store',
29
+ 'ai_classroom_suite/MediaVectorStores.py'),
30
+ 'ai_classroom_suite.MediaVectorStores.files_to_text': ( 'media_stores.html#files_to_text',
31
+ 'ai_classroom_suite/MediaVectorStores.py'),
32
+ 'ai_classroom_suite.MediaVectorStores.get_document_segments': ( 'media_stores.html#get_document_segments',
33
+ 'ai_classroom_suite/MediaVectorStores.py'),
34
+ 'ai_classroom_suite.MediaVectorStores.get_youtube_transcript': ( 'media_stores.html#get_youtube_transcript',
35
+ 'ai_classroom_suite/MediaVectorStores.py'),
36
+ 'ai_classroom_suite.MediaVectorStores.rawtext_to_doc_split': ( 'media_stores.html#rawtext_to_doc_split',
37
+ 'ai_classroom_suite/MediaVectorStores.py'),
38
+ 'ai_classroom_suite.MediaVectorStores.save_text': ( 'media_stores.html#save_text',
39
+ 'ai_classroom_suite/MediaVectorStores.py'),
40
+ 'ai_classroom_suite.MediaVectorStores.website_to_text_unstructured': ( 'media_stores.html#website_to_text_unstructured',
41
+ 'ai_classroom_suite/MediaVectorStores.py'),
42
+ 'ai_classroom_suite.MediaVectorStores.website_to_text_web': ( 'media_stores.html#website_to_text_web',
43
+ 'ai_classroom_suite/MediaVectorStores.py'),
44
+ 'ai_classroom_suite.MediaVectorStores.youtube_to_text': ( 'media_stores.html#youtube_to_text',
45
+ 'ai_classroom_suite/MediaVectorStores.py')},
46
+ 'ai_classroom_suite.PromptInteractionBase': { 'ai_classroom_suite.PromptInteractionBase.create_base_tutoring_prompt': ( 'prompt_interaction_base.html#create_base_tutoring_prompt',
47
+ 'ai_classroom_suite/PromptInteractionBase.py'),
48
+ 'ai_classroom_suite.PromptInteractionBase.create_model': ( 'prompt_interaction_base.html#create_model',
49
+ 'ai_classroom_suite/PromptInteractionBase.py'),
50
+ 'ai_classroom_suite.PromptInteractionBase.create_tutor_mdl_chain': ( 'prompt_interaction_base.html#create_tutor_mdl_chain',
51
+ 'ai_classroom_suite/PromptInteractionBase.py'),
52
+ 'ai_classroom_suite.PromptInteractionBase.get_tutoring_answer': ( 'prompt_interaction_base.html#get_tutoring_answer',
53
+ 'ai_classroom_suite/PromptInteractionBase.py'),
54
+ 'ai_classroom_suite.PromptInteractionBase.get_tutoring_prompt': ( 'prompt_interaction_base.html#get_tutoring_prompt',
55
+ 'ai_classroom_suite/PromptInteractionBase.py'),
56
+ 'ai_classroom_suite.PromptInteractionBase.set_openai_key': ( 'prompt_interaction_base.html#set_openai_key',
57
+ 'ai_classroom_suite/PromptInteractionBase.py')},
58
+ 'ai_classroom_suite.SelfStudyPrompts': { 'ai_classroom_suite.SelfStudyPrompts.list_all_self_study_prompt_keys': ( 'self_study_prompts.html#list_all_self_study_prompt_keys',
59
+ 'ai_classroom_suite/SelfStudyPrompts.py'),
60
+ 'ai_classroom_suite.SelfStudyPrompts.list_all_self_study_prompts': ( 'self_study_prompts.html#list_all_self_study_prompts',
61
+ 'ai_classroom_suite/SelfStudyPrompts.py'),
62
+ 'ai_classroom_suite.SelfStudyPrompts.list_default_self_prompt_varnames': ( 'self_study_prompts.html#list_default_self_prompt_varnames',
63
+ 'ai_classroom_suite/SelfStudyPrompts.py'),
64
+ 'ai_classroom_suite.SelfStudyPrompts.print_all_self_study_prompts': ( 'self_study_prompts.html#print_all_self_study_prompts',
65
+ 'ai_classroom_suite/SelfStudyPrompts.py')},
66
+ 'ai_classroom_suite.self_study_app': { 'ai_classroom_suite.self_study_app.SlightlyDelusionalTutor': ( 'gradio_application.html#slightlydelusionaltutor',
67
+ 'ai_classroom_suite/self_study_app.py'),
68
+ 'ai_classroom_suite.self_study_app.SlightlyDelusionalTutor.__init__': ( 'gradio_application.html#slightlydelusionaltutor.__init__',
69
+ 'ai_classroom_suite/self_study_app.py'),
70
+ 'ai_classroom_suite.self_study_app.SlightlyDelusionalTutor.add_user_message': ( 'gradio_application.html#slightlydelusionaltutor.add_user_message',
71
+ 'ai_classroom_suite/self_study_app.py'),
72
+ 'ai_classroom_suite.self_study_app.SlightlyDelusionalTutor.forget_conversation': ( 'gradio_application.html#slightlydelusionaltutor.forget_conversation',
73
+ 'ai_classroom_suite/self_study_app.py'),
74
+ 'ai_classroom_suite.self_study_app.SlightlyDelusionalTutor.get_sources_memory': ( 'gradio_application.html#slightlydelusionaltutor.get_sources_memory',
75
+ 'ai_classroom_suite/self_study_app.py'),
76
+ 'ai_classroom_suite.self_study_app.SlightlyDelusionalTutor.get_tutor_reply': ( 'gradio_application.html#slightlydelusionaltutor.get_tutor_reply',
77
+ 'ai_classroom_suite/self_study_app.py'),
78
+ 'ai_classroom_suite.self_study_app.SlightlyDelusionalTutor.initialize_llm': ( 'gradio_application.html#slightlydelusionaltutor.initialize_llm',
79
+ 'ai_classroom_suite/self_study_app.py'),
80
+ 'ai_classroom_suite.self_study_app.add_user_message': ( 'gradio_application.html#add_user_message',
81
+ 'ai_classroom_suite/self_study_app.py'),
82
+ 'ai_classroom_suite.self_study_app.create_reference_store': ( 'gradio_application.html#create_reference_store',
83
+ 'ai_classroom_suite/self_study_app.py'),
84
+ 'ai_classroom_suite.self_study_app.disable_until_done': ( 'gradio_application.html#disable_until_done',
85
+ 'ai_classroom_suite/self_study_app.py'),
86
+ 'ai_classroom_suite.self_study_app.embed_key': ( 'gradio_application.html#embed_key',
87
+ 'ai_classroom_suite/self_study_app.py'),
88
+ 'ai_classroom_suite.self_study_app.get_tutor_reply': ( 'gradio_application.html#get_tutor_reply',
89
+ 'ai_classroom_suite/self_study_app.py'),
90
+ 'ai_classroom_suite.self_study_app.prompt_select': ( 'gradio_application.html#prompt_select',
91
+ 'ai_classroom_suite/self_study_app.py'),
92
+ 'ai_classroom_suite.self_study_app.save_chatbot_dialogue': ( 'gradio_application.html#save_chatbot_dialogue',
93
+ 'ai_classroom_suite/self_study_app.py')}}}
ai_classroom_suite/self_study_app.py ADDED
@@ -0,0 +1,358 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/gradio_application.ipynb.
2
+
3
+ # %% auto 0
4
+ __all__ = ['save_pdf', 'save_json', 'save_txt', 'save_csv', 'num_sources', 'css', 'save_chatbot_dialogue',
5
+ 'SlightlyDelusionalTutor', 'embed_key', 'create_reference_store', 'prompt_select', 'add_user_message',
6
+ 'get_tutor_reply', 'disable_until_done']
7
+
8
+ # %% ../nbs/gradio_application.ipynb 9
9
+ import gradio as gr
10
+ from functools import partial
11
+ import pandas as pd
12
+ import os
13
+
14
+ from .PromptInteractionBase import *
15
+ from .IOHelperUtilities import *
16
+ from .SelfStudyPrompts import *
17
+ from .MediaVectorStores import *
18
+
19
+ # %% ../nbs/gradio_application.ipynb 13
20
+ def save_chatbot_dialogue(chat_tutor, save_type):
21
+
22
+ formatted_convo = pd.DataFrame(chat_tutor.conversation_memory, columns=['user', 'chatbot'])
23
+
24
+ output_fname = f'tutoring_conversation.{save_type}'
25
+
26
+ if save_type == 'csv':
27
+ formatted_convo.to_csv(output_fname, index=False)
28
+ elif save_type == 'json':
29
+ formatted_convo.to_json(output_fname, orient='records')
30
+ elif save_type == 'txt':
31
+ temp = formatted_convo.apply(lambda x: 'User: {0}\nAI: {1}'.format(x[0], x[1]), axis=1)
32
+ temp = '\n\n'.join(temp.tolist())
33
+ with open(output_fname, 'w') as f:
34
+ f.write(temp)
35
+ else:
36
+ gr.update(value=None, visible=False)
37
+
38
+ return gr.update(value=output_fname, visible=True)
39
+
40
+ save_pdf = partial(save_chatbot_dialogue, save_type='pdf')
41
+ save_json = partial(save_chatbot_dialogue, save_type='json')
42
+ save_txt = partial(save_chatbot_dialogue, save_type='txt')
43
+ save_csv = partial(save_chatbot_dialogue, save_type='csv')
44
+
45
+
46
+ # %% ../nbs/gradio_application.ipynb 16
47
+ class SlightlyDelusionalTutor:
48
+ # create basic initialization function
49
+ def __init__(self, model_name = None):
50
+
51
+ # create default model name
52
+ if model_name is None:
53
+ self.model_name = 'gpt-3.5-turbo-16k'
54
+
55
+ self.chat_llm = None
56
+ self.tutor_chain = None
57
+ self.vector_store = None
58
+ self.vs_retriever = None
59
+ self.conversation_memory = []
60
+ self.sources_memory = []
61
+ self.flattened_conversation = ''
62
+ self.api_key_valid = False
63
+ self.learning_objectives = None
64
+ self.openai_auth = ''
65
+
66
+ def initialize_llm(self):
67
+
68
+ if self.openai_auth:
69
+ try:
70
+ self.chat_llm = create_model(self.model_name, openai_api_key = self.openai_auth)
71
+ self.api_key_valid = True
72
+ except Exception as e:
73
+ print(e)
74
+ self.api_key_valid = False
75
+ else:
76
+ print("Please provide an OpenAI API key and press Enter.")
77
+
78
+ def add_user_message(self, user_message):
79
+ self.conversation_memory.append([user_message, None])
80
+ self.flattened_conversation = self.flattened_conversation + '\n\n' + 'User: ' + user_message
81
+
82
+ def get_tutor_reply(self, **input_kwargs):
83
+
84
+ if not self.conversation_memory:
85
+ return "Please type something to start the conversation."
86
+
87
+ # we want to have a different vector comparison for reference lookup after the topic is first used
88
+ if len(self.conversation_memory) > 1:
89
+ if 'question' in input_kwargs.keys():
90
+ if input_kwargs['question']:
91
+ input_kwargs['question'] = self.conversation_memory[-1][0] + ' keeping in mind I want to learn about ' + input_kwargs['question']
92
+ else:
93
+ input_kwargs['question'] = self.conversation_memory[-1][0]
94
+
95
+ # get tutor message
96
+ tutor_message = get_tutoring_answer(None,
97
+ self.tutor_chain,
98
+ assessment_request = self.flattened_conversation + 'First, please provide your feedback on my previous answer if I was answering a question, otherwise, respond appropriately to my statement. Then, help me with the following:' + self.conversation_memory[-1][0],
99
+ learning_objectives = self.learning_objectives,
100
+ return_dict=True,
101
+ **input_kwargs)
102
+
103
+ # add tutor message to conversation memory
104
+ self.conversation_memory[-1][1] = tutor_message['answer']
105
+ self.flattened_conversation = self.flattened_conversation + '\nAI: ' + tutor_message['answer']
106
+ self.sources_memory.append(tutor_message['source_documents'])
107
+ #print(self.flattened_conversation, '\n\n')
108
+ print(tutor_message['source_documents'])
109
+
110
+ def get_sources_memory(self):
111
+ # retrieve last source
112
+ last_sources = self.sources_memory[-1]
113
+
114
+ # get page_content keyword from last_sources
115
+ doc_contents = ['Source ' + str(ind+1) + '\n"' + doc.page_content + '"\n\n' for ind, doc in enumerate(last_sources)]
116
+ doc_contents = ''.join(doc_contents)
117
+
118
+ return doc_contents
119
+
120
+ def forget_conversation(self):
121
+ self.conversation_memory = []
122
+ self.sources_memory = []
123
+ self.flattened_conversation = ''
124
+
125
+ # %% ../nbs/gradio_application.ipynb 18
126
+ def embed_key(openai_api_key, chat_tutor):
127
+ if not openai_api_key:
128
+ return chat_tutor
129
+
130
+ # Otherwise, update key
131
+ os.environ["OPENAI_API_KEY"] = openai_api_key
132
+
133
+ #update tutor
134
+ chat_tutor.openai_auth = openai_api_key
135
+
136
+ if not chat_tutor.api_key_valid:
137
+ chat_tutor.initialize_llm()
138
+
139
+ return chat_tutor
140
+
141
+ # %% ../nbs/gradio_application.ipynb 20
142
+ def create_reference_store(chat_tutor, vs_button, text_cp, upload_files, reference_vs, openai_auth, learning_objs):
143
+
144
+ text_segs = []
145
+ upload_segs = []
146
+
147
+ if reference_vs:
148
+ raise NotImplementedError("Reference Vector Stores are not yet implemented")
149
+
150
+ if text_cp.strip():
151
+ text_segs = get_document_segments(text_cp, 'text', chunk_size=700, chunk_overlap=100)
152
+ [doc.metadata.update({'source':'text box'}) for doc in text_segs];
153
+
154
+ if upload_files:
155
+ print(upload_files)
156
+ upload_fnames = [f.name for f in upload_files]
157
+ upload_segs = get_document_segments(upload_fnames, 'file', chunk_size=700, chunk_overlap=100)
158
+
159
+ # get the full list of everything
160
+ all_segs = text_segs + upload_segs
161
+ print(all_segs)
162
+
163
+ # create the vector store and update tutor
164
+ vs_db, vs_retriever = create_local_vector_store(all_segs, search_kwargs={"k": 2})
165
+ chat_tutor.vector_store = vs_db
166
+ chat_tutor.vs_retriever = vs_retriever
167
+
168
+ # create the tutor chain
169
+ if not chat_tutor.api_key_valid or not chat_tutor.openai_auth:
170
+ chat_tutor = embed_key(openai_auth, chat_tutor)
171
+ qa_chain = create_tutor_mdl_chain(kind="retrieval_qa", mdl=chat_tutor.chat_llm, retriever = chat_tutor.vs_retriever, return_source_documents=True)
172
+ chat_tutor.tutor_chain = qa_chain
173
+
174
+ # store learning objectives
175
+ chat_tutor.learning_objectives = learning_objs
176
+
177
+ # return the story
178
+ return chat_tutor, gr.update(interactive=True, value='Tutor Initialized!')
179
+
180
+ # %% ../nbs/gradio_application.ipynb 22
181
+ ### Gradio Called Functions ###
182
+
183
+ def prompt_select(selection, number, length):
184
+ if selection == "Random":
185
+ prompt = f"Please design a {number} question quiz based on the context provided and the inputted learning objectives (if applicable). The types of questions should be randomized (including multiple choice, short answer, true/false, short answer, etc.). Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide 1 question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
186
+ elif selection == "Fill in the Blank":
187
+ prompt = f"Create a {number} question fill in the blank quiz refrencing the context provided. The quiz should reflect the learning objectives (if inputted). The 'blank' part of the question should appear as '________'. The answers should reflect what word(s) should go in the blank an accurate statement. An example is the follow: 'The author of the article is ______.' The question should be a statement. Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
188
+ elif selection == "Short Answer":
189
+ prompt = f"Please design a {number} question quiz about which reflects the learning objectives (if inputted). The questions should be short answer. Expect the correct answers to be {length} sentences long. Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional chances to respond until I get the correct choice. Explain why the correct answer is right."
190
+ else:
191
+ prompt = f"Please design a {number} question {selection.lower()} quiz based on the context provided and the inputted learning objectives (if applicable). Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide 1 question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
192
+ return prompt, prompt
193
+
194
+
195
+ # %% ../nbs/gradio_application.ipynb 24
196
+ ### Chatbot Functions ###
197
+
198
+ def add_user_message(user_message, chat_tutor):
199
+ """Display user message and update chat history to include it.
200
+ Also disables user text input until bot is finished (call to reenable_chat())
201
+ See https://gradio.app/creating-a-chatbot/"""
202
+ chat_tutor.add_user_message(user_message)
203
+ return gr.update(value="", interactive=False), chat_tutor.conversation_memory, chat_tutor
204
+
205
+ def get_tutor_reply(learning_topic, chat_tutor):
206
+ chat_tutor.get_tutor_reply(input_kwargs={'question':learning_topic})
207
+ return gr.update(value="", interactive=True), gr.update(visible=True, value=chat_tutor.get_sources_memory()), chat_tutor.conversation_memory, chat_tutor
208
+
209
+ num_sources = 2
210
+
211
+ # %% ../nbs/gradio_application.ipynb 25
212
+ def disable_until_done(obj_in):
213
+ return gr.update(interactive=False)
214
+
215
+ # %% ../nbs/gradio_application.ipynb 27
216
+ # See https://gradio.app/custom-CSS-and-JS/
217
+ css="""
218
+ #sources-container {
219
+ overflow: scroll !important; /* Needs to override default formatting */
220
+ /*max-height: 20em; */ /* Arbitrary value */
221
+ }
222
+ #sources-container > div { padding-bottom: 1em !important; /* Arbitrary value */ }
223
+ .short-height > * > * { min-height: 0 !important; }
224
+ .translucent { opacity: 0.5; }
225
+ .textbox_label { padding-bottom: .5em; }
226
+ """
227
+ #srcs = [] # Reset sources (db and qa are kept the same for ease of testing)
228
+
229
+ with gr.Blocks(css=css, analytics_enabled=False) as demo:
230
+
231
+ #initialize tutor (with state)
232
+ study_tutor = gr.State(SlightlyDelusionalTutor())
233
+
234
+ # Title
235
+ gr.Markdown("# Studying with a Slightly Delusional Tutor")
236
+
237
+ # API Authentication functionality
238
+ with gr.Box():
239
+ gr.Markdown("### OpenAI API Key ")
240
+ gr.HTML("""<span>Embed your OpenAI API key below; if you haven't created one already, visit
241
+ <a href="https://platform.openai.com/account/api-keys">platform.openai.com/account/api-keys</a>
242
+ to sign up for an account and get your personal API key</span>""",
243
+ elem_classes="textbox_label")
244
+ api_input = gr.Textbox(show_label=False, type="password", container=False, autofocus=True,
245
+ placeholder="●●●●●●●●●●●●●●●●●", value='')
246
+ api_input.submit(fn=embed_key, inputs=[api_input, study_tutor], outputs=study_tutor)
247
+ api_input.blur(fn=embed_key, inputs=[api_input, study_tutor], outputs=study_tutor)
248
+
249
+ # Reference document functionality (building vector stores)
250
+ with gr.Box():
251
+ gr.Markdown("### Add Reference Documents")
252
+ # TODO Add entry for path to vector store (should be disabled for now)
253
+ with gr.Row(equal_height=True):
254
+ text_input = gr.TextArea(label='Copy and paste your text below',
255
+ lines=2)
256
+
257
+ file_input = gr.Files(label="Load a .txt or .pdf file",
258
+ file_types=['.pdf', '.txt'], type="file",
259
+ elem_classes="short-height")
260
+
261
+ instructor_input = gr.TextArea(label='Enter vector store URL, if given by instructor (WIP)', value='',
262
+ lines=2, interactive=False, elem_classes="translucent")
263
+
264
+ # Adding the learning objectives
265
+ with gr.Box():
266
+ gr.Markdown("### Optional: Enter Your Learning Objectives")
267
+ learning_objectives = gr.Textbox(label='If provided by your instructor, please input your learning objectives for this session', value='')
268
+
269
+ # Adding the button to submit all of the settings and create the Chat Tutor Chain.
270
+ with gr.Row():
271
+ vs_build_button = gr.Button(value = 'Start Studying with Your Tutor!', scale=1)
272
+ vs_build_button.click(disable_until_done, vs_build_button, vs_build_button) \
273
+ .then(create_reference_store, [study_tutor, vs_build_button, text_input, file_input, instructor_input, api_input, learning_objectives],
274
+ [study_tutor, vs_build_button])
275
+
276
+
277
+
278
+ # Premade question prompts
279
+ with gr.Box():
280
+ gr.Markdown("""
281
+ ## Generate a Premade Prompt
282
+ Select your type and number of desired questions. Click "Generate Prompt" to get your premade prompt,
283
+ and then "Insert Prompt into Chat" to copy the text into the chat interface below. \
284
+ You can also copy the prompt using the icon in the upper right corner and paste directly into the input box when interacting with the model.
285
+ """)
286
+ with gr.Row():
287
+ with gr.Column():
288
+ question_type = gr.Dropdown(["Multiple Choice", "True or False", "Short Answer", "Fill in the Blank", "Random"], label="Question Type")
289
+ number_of_questions = gr.Textbox(label="Enter desired number of questions")
290
+ sa_desired_length = gr.Dropdown(["1-2", "3-4", "5-6", "6 or more"], label = "For short answer questions only, choose the desired sentence length for answers. The default value is 1-2 sentences.")
291
+ with gr.Column():
292
+ prompt_button = gr.Button("Generate Prompt")
293
+ premade_prompt_output = gr.Textbox(label="Generated prompt (save or copy)", show_copy_button=True)
294
+
295
+
296
+ # Chatbot interface
297
+ gr.Markdown("## Chat with the Model")
298
+ topic_input = gr.Textbox(label="What topic or concept are you trying to learn more about?")
299
+ with gr.Row(equal_height=True):
300
+ with gr.Column(scale=2):
301
+ chatbot = gr.Chatbot()
302
+ with gr.Row():
303
+ user_chat_input = gr.Textbox(label="User input", scale=9)
304
+ user_chat_submit = gr.Button("Ask/answer model", scale=1)
305
+
306
+ # sources
307
+ with gr.Box(elem_id="sources-container", scale=1):
308
+ # TODO: Display document sources in a nicer format?
309
+ gr.HTML(value="<h3 id='sources'>Referenced Sources</h3>")
310
+ sources_output = gr.Textbox(value='', interactive=False, visible=False, show_label=False)
311
+ #sources_output = []
312
+ #for i in range(num_sources):
313
+ # source_elem = gr.HTML(visible=False)
314
+ # sources_output.append(source_elem)
315
+
316
+ #define the behavior of prompt button later since it depends on user_chat_input
317
+ prompt_button.click(prompt_select,
318
+ inputs=[question_type, number_of_questions, sa_desired_length],
319
+ outputs=[premade_prompt_output, user_chat_input])
320
+
321
+ # Display input and output in three-ish parts
322
+ # (using asynchronous functions):
323
+ # First show user input, then show model output when complete
324
+ # Then wait until the bot provides response and return the result
325
+ # Finally, allow the user to ask a new question by reenabling input
326
+ async_response = user_chat_submit.click(add_user_message,
327
+ [user_chat_input, study_tutor],
328
+ [user_chat_input, chatbot, study_tutor], queue=False) \
329
+ .then(get_tutor_reply, [topic_input, study_tutor], [user_chat_input, sources_output, chatbot, study_tutor], queue=True)
330
+
331
+ async_response_b = user_chat_input.submit(add_user_message,
332
+ [user_chat_input, study_tutor],
333
+ [user_chat_input, chatbot, study_tutor], queue=False) \
334
+ .then(get_tutor_reply, [topic_input, study_tutor], [user_chat_input, sources_output, chatbot, study_tutor], queue=True)
335
+
336
+ with gr.Blocks():
337
+ gr.Markdown("""
338
+ ## Export Your Chat History
339
+ Export your chat history as a .json, PDF file, .txt, or .csv file
340
+ """)
341
+ with gr.Row():
342
+ export_dialogue_button_json = gr.Button("JSON")
343
+ export_dialogue_button_pdf = gr.Button("PDF")
344
+ export_dialogue_button_txt = gr.Button("TXT")
345
+ export_dialogue_button_csv = gr.Button("CSV")
346
+
347
+ file_download = gr.Files(label="Download here",
348
+ file_types=['.pdf', '.txt', '.csv', '.json'], type="file", visible=False)
349
+
350
+ export_dialogue_button_json.click(save_json, study_tutor, file_download, show_progress=True)
351
+ export_dialogue_button_pdf.click(save_pdf, study_tutor, file_download, show_progress=True)
352
+ export_dialogue_button_txt.click(save_txt, study_tutor, file_download, show_progress=True)
353
+ export_dialogue_button_csv.click(save_csv, study_tutor, file_download, show_progress=True)
354
+
355
+ demo.queue()
356
+ demo.launch(debug=True)
357
+ #demo.launch()
358
+ #gr.close_all()
basic_UI_design_oral_exam.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
grading_from_json.ipynb ADDED
@@ -0,0 +1,606 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "colab_type": "text",
7
+ "id": "view-in-github"
8
+ },
9
+ "source": [
10
+ "<a href=\"https://colab.research.google.com/github/vanderbilt-data-science/lo-achievement/blob/main/grading_from_json.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "code",
15
+ "execution_count": null,
16
+ "metadata": {
17
+ "id": "kfO7rE64ZTI_"
18
+ },
19
+ "outputs": [],
20
+ "source": [
21
+ "!pip install openai"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "code",
26
+ "execution_count": 2,
27
+ "metadata": {
28
+ "id": "f26sZpe-MCCj"
29
+ },
30
+ "outputs": [],
31
+ "source": [
32
+ "import json\n",
33
+ "import openai\n",
34
+ "import os\n",
35
+ "import pandas as pd"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "code",
40
+ "execution_count": 4,
41
+ "metadata": {
42
+ "colab": {
43
+ "base_uri": "https://localhost:8080/",
44
+ "height": 614
45
+ },
46
+ "id": "BVTr_mR0XIJI",
47
+ "outputId": "897e41a0-d5e1-4b5f-d254-0a6e0f6aa3fa"
48
+ },
49
+ "outputs": [
50
+ {
51
+ "data": {
52
+ "text/html": [
53
+ "\n",
54
+ " <div id=\"df-e24b7014-4d98-4fc5-9ff1-07fa5c26ba5e\">\n",
55
+ " <div class=\"colab-df-container\">\n",
56
+ " <div>\n",
57
+ "<style scoped>\n",
58
+ " .dataframe tbody tr th:only-of-type {\n",
59
+ " vertical-align: middle;\n",
60
+ " }\n",
61
+ "\n",
62
+ " .dataframe tbody tr th {\n",
63
+ " vertical-align: top;\n",
64
+ " }\n",
65
+ "\n",
66
+ " .dataframe thead th {\n",
67
+ " text-align: right;\n",
68
+ " }\n",
69
+ "</style>\n",
70
+ "<table border=\"1\" class=\"dataframe\">\n",
71
+ " <thead>\n",
72
+ " <tr style=\"text-align: right;\">\n",
73
+ " <th></th>\n",
74
+ " <th>timestamp</th>\n",
75
+ " <th>author</th>\n",
76
+ " <th>message</th>\n",
77
+ " </tr>\n",
78
+ " </thead>\n",
79
+ " <tbody>\n",
80
+ " <tr>\n",
81
+ " <th>0</th>\n",
82
+ " <td>2023-06-07 08:16:00+00:00</td>\n",
83
+ " <td>assistant</td>\n",
84
+ " <td>Question 1:\\nWhich of the following statements...</td>\n",
85
+ " </tr>\n",
86
+ " <tr>\n",
87
+ " <th>1</th>\n",
88
+ " <td>2023-06-07 08:16:30+00:00</td>\n",
89
+ " <td>user</td>\n",
90
+ " <td>C</td>\n",
91
+ " </tr>\n",
92
+ " <tr>\n",
93
+ " <th>2</th>\n",
94
+ " <td>2023-06-07 08:17:00+00:00</td>\n",
95
+ " <td>assistant</td>\n",
96
+ " <td>Correct! Option C is the correct answer...</td>\n",
97
+ " </tr>\n",
98
+ " <tr>\n",
99
+ " <th>3</th>\n",
100
+ " <td>2023-06-07 08:17:30+00:00</td>\n",
101
+ " <td>assistant</td>\n",
102
+ " <td>Question 2:\\nWhich of the following expenses a...</td>\n",
103
+ " </tr>\n",
104
+ " <tr>\n",
105
+ " <th>4</th>\n",
106
+ " <td>2023-06-07 08:18:00+00:00</td>\n",
107
+ " <td>user</td>\n",
108
+ " <td>A</td>\n",
109
+ " </tr>\n",
110
+ " <tr>\n",
111
+ " <th>5</th>\n",
112
+ " <td>2023-06-07 08:18:30+00:00</td>\n",
113
+ " <td>assistant</td>\n",
114
+ " <td>I'm sorry, but your answer is incorrect...</td>\n",
115
+ " </tr>\n",
116
+ " <tr>\n",
117
+ " <th>6</th>\n",
118
+ " <td>2023-06-07 08:19:00+00:00</td>\n",
119
+ " <td>assistant</td>\n",
120
+ " <td>Question 2 (Revised):\\nWhich of the following ...</td>\n",
121
+ " </tr>\n",
122
+ " <tr>\n",
123
+ " <th>7</th>\n",
124
+ " <td>2023-06-07 08:19:30+00:00</td>\n",
125
+ " <td>user</td>\n",
126
+ " <td>D</td>\n",
127
+ " </tr>\n",
128
+ " <tr>\n",
129
+ " <th>8</th>\n",
130
+ " <td>2023-06-07 08:20:00+00:00</td>\n",
131
+ " <td>assistant</td>\n",
132
+ " <td>Correct! Option D is the correct answer...</td>\n",
133
+ " </tr>\n",
134
+ " <tr>\n",
135
+ " <th>9</th>\n",
136
+ " <td>2023-06-07 08:20:30+00:00</td>\n",
137
+ " <td>assistant</td>\n",
138
+ " <td>Question 3:\\nWhat is the purpose of capitalizi...</td>\n",
139
+ " </tr>\n",
140
+ " <tr>\n",
141
+ " <th>10</th>\n",
142
+ " <td>2023-06-07 08:21:00+00:00</td>\n",
143
+ " <td>user</td>\n",
144
+ " <td>C</td>\n",
145
+ " </tr>\n",
146
+ " <tr>\n",
147
+ " <th>11</th>\n",
148
+ " <td>2023-06-07 08:21:30+00:00</td>\n",
149
+ " <td>assistant</td>\n",
150
+ " <td>Correct! Option C is the correct answer...</td>\n",
151
+ " </tr>\n",
152
+ " <tr>\n",
153
+ " <th>12</th>\n",
154
+ " <td>2023-06-07 08:22:00+00:00</td>\n",
155
+ " <td>assistant</td>\n",
156
+ " <td>Question 4:\\nWhich financial statement provide...</td>\n",
157
+ " </tr>\n",
158
+ " <tr>\n",
159
+ " <th>13</th>\n",
160
+ " <td>2023-06-07 08:22:30+00:00</td>\n",
161
+ " <td>user</td>\n",
162
+ " <td>C</td>\n",
163
+ " </tr>\n",
164
+ " <tr>\n",
165
+ " <th>14</th>\n",
166
+ " <td>2023-06-07 08:23:00+00:00</td>\n",
167
+ " <td>assistant</td>\n",
168
+ " <td>Correct! Option C is the correct answer...</td>\n",
169
+ " </tr>\n",
170
+ " <tr>\n",
171
+ " <th>15</th>\n",
172
+ " <td>2023-06-07 08:23:30+00:00</td>\n",
173
+ " <td>assistant</td>\n",
174
+ " <td>Question 5:\\nWhat is the purpose of the matchi...</td>\n",
175
+ " </tr>\n",
176
+ " <tr>\n",
177
+ " <th>16</th>\n",
178
+ " <td>2023-06-07 08:24:00+00:00</td>\n",
179
+ " <td>user</td>\n",
180
+ " <td>B</td>\n",
181
+ " </tr>\n",
182
+ " <tr>\n",
183
+ " <th>17</th>\n",
184
+ " <td>2023-06-07 08:24:30+00:00</td>\n",
185
+ " <td>assistant</td>\n",
186
+ " <td>Correct! Option B is the correct answer...</td>\n",
187
+ " </tr>\n",
188
+ " </tbody>\n",
189
+ "</table>\n",
190
+ "</div>\n",
191
+ " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e24b7014-4d98-4fc5-9ff1-07fa5c26ba5e')\"\n",
192
+ " title=\"Convert this dataframe to an interactive table.\"\n",
193
+ " style=\"display:none;\">\n",
194
+ " \n",
195
+ " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
196
+ " width=\"24px\">\n",
197
+ " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
198
+ " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
199
+ " </svg>\n",
200
+ " </button>\n",
201
+ " \n",
202
+ " <style>\n",
203
+ " .colab-df-container {\n",
204
+ " display:flex;\n",
205
+ " flex-wrap:wrap;\n",
206
+ " gap: 12px;\n",
207
+ " }\n",
208
+ "\n",
209
+ " .colab-df-convert {\n",
210
+ " background-color: #E8F0FE;\n",
211
+ " border: none;\n",
212
+ " border-radius: 50%;\n",
213
+ " cursor: pointer;\n",
214
+ " display: none;\n",
215
+ " fill: #1967D2;\n",
216
+ " height: 32px;\n",
217
+ " padding: 0 0 0 0;\n",
218
+ " width: 32px;\n",
219
+ " }\n",
220
+ "\n",
221
+ " .colab-df-convert:hover {\n",
222
+ " background-color: #E2EBFA;\n",
223
+ " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
224
+ " fill: #174EA6;\n",
225
+ " }\n",
226
+ "\n",
227
+ " [theme=dark] .colab-df-convert {\n",
228
+ " background-color: #3B4455;\n",
229
+ " fill: #D2E3FC;\n",
230
+ " }\n",
231
+ "\n",
232
+ " [theme=dark] .colab-df-convert:hover {\n",
233
+ " background-color: #434B5C;\n",
234
+ " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
235
+ " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
236
+ " fill: #FFFFFF;\n",
237
+ " }\n",
238
+ " </style>\n",
239
+ "\n",
240
+ " <script>\n",
241
+ " const buttonEl =\n",
242
+ " document.querySelector('#df-e24b7014-4d98-4fc5-9ff1-07fa5c26ba5e button.colab-df-convert');\n",
243
+ " buttonEl.style.display =\n",
244
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
245
+ "\n",
246
+ " async function convertToInteractive(key) {\n",
247
+ " const element = document.querySelector('#df-e24b7014-4d98-4fc5-9ff1-07fa5c26ba5e');\n",
248
+ " const dataTable =\n",
249
+ " await google.colab.kernel.invokeFunction('convertToInteractive',\n",
250
+ " [key], {});\n",
251
+ " if (!dataTable) return;\n",
252
+ "\n",
253
+ " const docLinkHtml = 'Like what you see? Visit the ' +\n",
254
+ " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
255
+ " + ' to learn more about interactive tables.';\n",
256
+ " element.innerHTML = '';\n",
257
+ " dataTable['output_type'] = 'display_data';\n",
258
+ " await google.colab.output.renderOutput(dataTable, element);\n",
259
+ " const docLink = document.createElement('div');\n",
260
+ " docLink.innerHTML = docLinkHtml;\n",
261
+ " element.appendChild(docLink);\n",
262
+ " }\n",
263
+ " </script>\n",
264
+ " </div>\n",
265
+ " </div>\n",
266
+ " "
267
+ ],
268
+ "text/plain": [
269
+ " timestamp author \\\n",
270
+ "0 2023-06-07 08:16:00+00:00 assistant \n",
271
+ "1 2023-06-07 08:16:30+00:00 user \n",
272
+ "2 2023-06-07 08:17:00+00:00 assistant \n",
273
+ "3 2023-06-07 08:17:30+00:00 assistant \n",
274
+ "4 2023-06-07 08:18:00+00:00 user \n",
275
+ "5 2023-06-07 08:18:30+00:00 assistant \n",
276
+ "6 2023-06-07 08:19:00+00:00 assistant \n",
277
+ "7 2023-06-07 08:19:30+00:00 user \n",
278
+ "8 2023-06-07 08:20:00+00:00 assistant \n",
279
+ "9 2023-06-07 08:20:30+00:00 assistant \n",
280
+ "10 2023-06-07 08:21:00+00:00 user \n",
281
+ "11 2023-06-07 08:21:30+00:00 assistant \n",
282
+ "12 2023-06-07 08:22:00+00:00 assistant \n",
283
+ "13 2023-06-07 08:22:30+00:00 user \n",
284
+ "14 2023-06-07 08:23:00+00:00 assistant \n",
285
+ "15 2023-06-07 08:23:30+00:00 assistant \n",
286
+ "16 2023-06-07 08:24:00+00:00 user \n",
287
+ "17 2023-06-07 08:24:30+00:00 assistant \n",
288
+ "\n",
289
+ " message \n",
290
+ "0 Question 1:\\nWhich of the following statements... \n",
291
+ "1 C \n",
292
+ "2 Correct! Option C is the correct answer... \n",
293
+ "3 Question 2:\\nWhich of the following expenses a... \n",
294
+ "4 A \n",
295
+ "5 I'm sorry, but your answer is incorrect... \n",
296
+ "6 Question 2 (Revised):\\nWhich of the following ... \n",
297
+ "7 D \n",
298
+ "8 Correct! Option D is the correct answer... \n",
299
+ "9 Question 3:\\nWhat is the purpose of capitalizi... \n",
300
+ "10 C \n",
301
+ "11 Correct! Option C is the correct answer... \n",
302
+ "12 Question 4:\\nWhich financial statement provide... \n",
303
+ "13 C \n",
304
+ "14 Correct! Option C is the correct answer... \n",
305
+ "15 Question 5:\\nWhat is the purpose of the matchi... \n",
306
+ "16 B \n",
307
+ "17 Correct! Option B is the correct answer... "
308
+ ]
309
+ },
310
+ "execution_count": 4,
311
+ "metadata": {},
312
+ "output_type": "execute_result"
313
+ }
314
+ ],
315
+ "source": [
316
+ "df = pd.read_json('demo_json.json')\n",
317
+ "pd.read_json('demo_json.json')"
318
+ ]
319
+ },
320
+ {
321
+ "cell_type": "code",
322
+ "execution_count": 5,
323
+ "metadata": {
324
+ "id": "anSNlvqlXh6i"
325
+ },
326
+ "outputs": [],
327
+ "source": [
328
+ "openai.api_key = \"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\""
329
+ ]
330
+ },
331
+ {
332
+ "cell_type": "code",
333
+ "execution_count": 8,
334
+ "metadata": {
335
+ "colab": {
336
+ "base_uri": "https://localhost:8080/",
337
+ "height": 627
338
+ },
339
+ "id": "udujJrX6SryU",
340
+ "outputId": "9b182162-7c1c-4d5a-be56-16947ddcda33"
341
+ },
342
+ "outputs": [
343
+ {
344
+ "data": {
345
+ "text/html": [
346
+ "\n",
347
+ " <div id=\"df-5123f950-1dca-46a6-be4d-dab5de1f8899\">\n",
348
+ " <div class=\"colab-df-container\">\n",
349
+ " <div>\n",
350
+ "<style scoped>\n",
351
+ " .dataframe tbody tr th:only-of-type {\n",
352
+ " vertical-align: middle;\n",
353
+ " }\n",
354
+ "\n",
355
+ " .dataframe tbody tr th {\n",
356
+ " vertical-align: top;\n",
357
+ " }\n",
358
+ "\n",
359
+ " .dataframe thead th {\n",
360
+ " text-align: right;\n",
361
+ " }\n",
362
+ "</style>\n",
363
+ "<table border=\"1\" class=\"dataframe\">\n",
364
+ " <thead>\n",
365
+ " <tr style=\"text-align: right;\">\n",
366
+ " <th></th>\n",
367
+ " <th>Question</th>\n",
368
+ " <th>Correct Answer</th>\n",
369
+ " <th>User Answer</th>\n",
370
+ " <th>Evaluation</th>\n",
371
+ " <th>Score</th>\n",
372
+ " </tr>\n",
373
+ " </thead>\n",
374
+ " <tbody>\n",
375
+ " <tr>\n",
376
+ " <th>0</th>\n",
377
+ " <td>Question 1:\\nWhich of the following statements...</td>\n",
378
+ " <td>C</td>\n",
379
+ " <td>C</td>\n",
380
+ " <td>correct.</td>\n",
381
+ " <td>1</td>\n",
382
+ " </tr>\n",
383
+ " <tr>\n",
384
+ " <th>1</th>\n",
385
+ " <td>Question 2 (Revised):\\nWhich of the following ...</td>\n",
386
+ " <td>D</td>\n",
387
+ " <td>D</td>\n",
388
+ " <td>incorrect. the correct answer is d, software d...</td>\n",
389
+ " <td>1</td>\n",
390
+ " </tr>\n",
391
+ " <tr>\n",
392
+ " <th>2</th>\n",
393
+ " <td>Question 3:\\nWhat is the purpose of capitalizi...</td>\n",
394
+ " <td>C</td>\n",
395
+ " <td>C</td>\n",
396
+ " <td>incorrect. the correct answer is b.</td>\n",
397
+ " <td>1</td>\n",
398
+ " </tr>\n",
399
+ " <tr>\n",
400
+ " <th>3</th>\n",
401
+ " <td>Question 4:\\nWhich financial statement provide...</td>\n",
402
+ " <td>C</td>\n",
403
+ " <td>C</td>\n",
404
+ " <td>correct</td>\n",
405
+ " <td>2</td>\n",
406
+ " </tr>\n",
407
+ " <tr>\n",
408
+ " <th>4</th>\n",
409
+ " <td>Question 5:\\nWhat is the purpose of the matchi...</td>\n",
410
+ " <td>B</td>\n",
411
+ " <td>B</td>\n",
412
+ " <td>correct</td>\n",
413
+ " <td>3</td>\n",
414
+ " </tr>\n",
415
+ " </tbody>\n",
416
+ "</table>\n",
417
+ "</div>\n",
418
+ " <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5123f950-1dca-46a6-be4d-dab5de1f8899')\"\n",
419
+ " title=\"Convert this dataframe to an interactive table.\"\n",
420
+ " style=\"display:none;\">\n",
421
+ " \n",
422
+ " <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
423
+ " width=\"24px\">\n",
424
+ " <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
425
+ " <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
426
+ " </svg>\n",
427
+ " </button>\n",
428
+ " \n",
429
+ " <style>\n",
430
+ " .colab-df-container {\n",
431
+ " display:flex;\n",
432
+ " flex-wrap:wrap;\n",
433
+ " gap: 12px;\n",
434
+ " }\n",
435
+ "\n",
436
+ " .colab-df-convert {\n",
437
+ " background-color: #E8F0FE;\n",
438
+ " border: none;\n",
439
+ " border-radius: 50%;\n",
440
+ " cursor: pointer;\n",
441
+ " display: none;\n",
442
+ " fill: #1967D2;\n",
443
+ " height: 32px;\n",
444
+ " padding: 0 0 0 0;\n",
445
+ " width: 32px;\n",
446
+ " }\n",
447
+ "\n",
448
+ " .colab-df-convert:hover {\n",
449
+ " background-color: #E2EBFA;\n",
450
+ " box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
451
+ " fill: #174EA6;\n",
452
+ " }\n",
453
+ "\n",
454
+ " [theme=dark] .colab-df-convert {\n",
455
+ " background-color: #3B4455;\n",
456
+ " fill: #D2E3FC;\n",
457
+ " }\n",
458
+ "\n",
459
+ " [theme=dark] .colab-df-convert:hover {\n",
460
+ " background-color: #434B5C;\n",
461
+ " box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
462
+ " filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
463
+ " fill: #FFFFFF;\n",
464
+ " }\n",
465
+ " </style>\n",
466
+ "\n",
467
+ " <script>\n",
468
+ " const buttonEl =\n",
469
+ " document.querySelector('#df-5123f950-1dca-46a6-be4d-dab5de1f8899 button.colab-df-convert');\n",
470
+ " buttonEl.style.display =\n",
471
+ " google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
472
+ "\n",
473
+ " async function convertToInteractive(key) {\n",
474
+ " const element = document.querySelector('#df-5123f950-1dca-46a6-be4d-dab5de1f8899');\n",
475
+ " const dataTable =\n",
476
+ " await google.colab.kernel.invokeFunction('convertToInteractive',\n",
477
+ " [key], {});\n",
478
+ " if (!dataTable) return;\n",
479
+ "\n",
480
+ " const docLinkHtml = 'Like what you see? Visit the ' +\n",
481
+ " '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
482
+ " + ' to learn more about interactive tables.';\n",
483
+ " element.innerHTML = '';\n",
484
+ " dataTable['output_type'] = 'display_data';\n",
485
+ " await google.colab.output.renderOutput(dataTable, element);\n",
486
+ " const docLink = document.createElement('div');\n",
487
+ " docLink.innerHTML = docLinkHtml;\n",
488
+ " element.appendChild(docLink);\n",
489
+ " }\n",
490
+ " </script>\n",
491
+ " </div>\n",
492
+ " </div>\n",
493
+ " "
494
+ ],
495
+ "text/plain": [
496
+ " Question Correct Answer \\\n",
497
+ "0 Question 1:\\nWhich of the following statements... C \n",
498
+ "1 Question 2 (Revised):\\nWhich of the following ... D \n",
499
+ "2 Question 3:\\nWhat is the purpose of capitalizi... C \n",
500
+ "3 Question 4:\\nWhich financial statement provide... C \n",
501
+ "4 Question 5:\\nWhat is the purpose of the matchi... B \n",
502
+ "\n",
503
+ " User Answer Evaluation Score \n",
504
+ "0 C correct. 1 \n",
505
+ "1 D incorrect. the correct answer is d, software d... 1 \n",
506
+ "2 C incorrect. the correct answer is b. 1 \n",
507
+ "3 C correct 2 \n",
508
+ "4 B correct 3 "
509
+ ]
510
+ },
511
+ "execution_count": 8,
512
+ "metadata": {},
513
+ "output_type": "execute_result"
514
+ }
515
+ ],
516
+ "source": [
517
+ "# Initialize necessary variables\n",
518
+ "prompt = \"\"\n",
519
+ "question = \"\"\n",
520
+ "correct_answer = \"\"\n",
521
+ "user_answer = \"\"\n",
522
+ "\n",
523
+ "# Initialize score\n",
524
+ "score = 0\n",
525
+ "\n",
526
+ "# Initialize an empty list to hold row data\n",
527
+ "row_data = []\n",
528
+ "\n",
529
+ "for index, row in df.iterrows():\n",
530
+ " author = row['author']\n",
531
+ " message = row['message']\n",
532
+ "\n",
533
+ " # Choose the appropriate prompt based on the author\n",
534
+ " if author == 'assistant':\n",
535
+ " if 'Question' in message:\n",
536
+ " question = message\n",
537
+ " user_answer = '' # Reset user_answer after a new question\n",
538
+ " elif 'Correct! Option' in message:\n",
539
+ " correct_answer = message.split('Option ')[1][0]\n",
540
+ " if user_answer: # If user_answer exists, make the API call\n",
541
+ " prompt = f\"Given the following question:\\n{question}\\nThe student responded with: {user_answer}\\nIs the student's response correct or incorrect?\"\n",
542
+ "\n",
543
+ " # Make an API call to OpenAI\n",
544
+ " api_response = openai.Completion.create(\n",
545
+ " engine='text-davinci-003',\n",
546
+ " prompt=prompt,\n",
547
+ " max_tokens=100,\n",
548
+ " temperature=0.7,\n",
549
+ " n=1,\n",
550
+ " stop=None\n",
551
+ " )\n",
552
+ "\n",
553
+ " # Extract and evaluate the generated response\n",
554
+ " generated_response = api_response.choices[0].text.strip().lower()\n",
555
+ "\n",
556
+ " # Update score based on generated_response\n",
557
+ " if 'correct' in generated_response and 'incorrect' not in generated_response:\n",
558
+ " score += 1\n",
559
+ "\n",
560
+ " # Create a dictionary for the current row\n",
561
+ " row_dict = {\n",
562
+ " 'Question': question,\n",
563
+ " 'Correct Answer': correct_answer,\n",
564
+ " 'User Answer': user_answer,\n",
565
+ " 'Evaluation': generated_response,\n",
566
+ " 'Score': score\n",
567
+ " }\n",
568
+ " # Append the row dictionary to row_data\n",
569
+ " row_data.append(row_dict)\n",
570
+ "\n",
571
+ " elif author == 'user':\n",
572
+ " user_answer = message\n",
573
+ "\n",
574
+ "# Create a DataFrame from row_data\n",
575
+ "output_df = pd.DataFrame(row_data)\n",
576
+ "output_df\n"
577
+ ]
578
+ }
579
+ ],
580
+ "metadata": {
581
+ "colab": {
582
+ "authorship_tag": "ABX9TyOn+FniXzrkHNKH5uAKgyUD",
583
+ "include_colab_link": true,
584
+ "provenance": []
585
+ },
586
+ "kernelspec": {
587
+ "display_name": "Python 3 (ipykernel)",
588
+ "language": "python",
589
+ "name": "python3"
590
+ },
591
+ "language_info": {
592
+ "codemirror_mode": {
593
+ "name": "ipython",
594
+ "version": 3
595
+ },
596
+ "file_extension": ".py",
597
+ "mimetype": "text/x-python",
598
+ "name": "python",
599
+ "nbconvert_exporter": "python",
600
+ "pygments_lexer": "ipython3",
601
+ "version": "3.8.16"
602
+ }
603
+ },
604
+ "nbformat": 4,
605
+ "nbformat_minor": 4
606
+ }
instructor_intr_notebook.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
instructor_vector_store_creator.ipynb ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "<a href=\"https://colab.research.google.com/github/vanderbilt-data-science/lo-achievement/blob/main/instructor_vector_store_creator.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "# Creating a Shared Vector Store (for Instructors)\n",
15
+ "\n",
16
+ "This notebook is for instructors to create a *vector store* which contains all of the information necessary for students to generate their own self-study materials using large language models. It is expected that instructors who will use this notebook know how to run and interact with a Jupyter Notebook, specifically on Google Colab.\n",
17
+ "\n",
18
+ ":::{.callout-info}\n",
19
+ "On Colab, there may be a pop-up saying 'Warning: This notebook was not authored by Google'. In that case, click 'Run anyways'. If you started this notebook from the Vanderbilt Data Science github, then you can trust the code in this notebook.\n",
20
+ ":::"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "markdown",
25
+ "metadata": {},
26
+ "source": [
27
+ "## Setting Up API Access\n",
28
+ "Much of the following code rely on certain *APIs* (application programming interfaces) which have limited access. You will need to get an *API key* for each of those services which will be inserted into the code to let the service know you are an authorized user."
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "markdown",
33
+ "metadata": {},
34
+ "source": [
35
+ "#### OpenAI"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "markdown",
40
+ "metadata": {},
41
+ "source": [
42
+ "First, you will need an **OpenAI API key**. To do this:\n",
43
+ "1. Visit [platform.openai.com/account/api-keys](https://platform.openai.com/account/api-keys) and sign up for an account.\n",
44
+ "2. Click 'Create a secret API key', and give it any name you want.\n",
45
+ "3. Copy the newly created key, either by right-clicking and pressing 'Copy' or using the keyboard shortcut -- Ctrl+C on Windows, Cmd+C on a Mac.\n",
46
+ "\n",
47
+ "Run the following code cell. You'll see a blank text box pop up -- paste your API key there (using the shortcut Ctrl+V on Windows, or Cmd+V if you are using a Mac) and press Enter."
48
+ ]
49
+ },
50
+ {
51
+ "cell_type": "code",
52
+ "execution_count": null,
53
+ "metadata": {},
54
+ "outputs": [],
55
+ "source": [
56
+ "OPENAI_API_KEY = getpass(\"OpenAI API key: \")"
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "markdown",
61
+ "metadata": {},
62
+ "source": [
63
+ "#### DeepLake\n",
64
+ "\n",
65
+ "Next, you will need to input a **DeepLake API key**, found in the DeepLake dashboard at [app.activeloop.ai](https://app.activeloop.ai).\n",
66
+ "\n",
67
+ "1. Click the link above and create an account.\n",
68
+ "2. After making an account, you will be prompted to set a username. Once you have set your username, copy it, run the code below, paste the username into the text box, and press Enter. (This username will be shared with students.)"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "code",
73
+ "execution_count": null,
74
+ "metadata": {},
75
+ "outputs": [],
76
+ "source": [
77
+ "DEEPLAKE_USERNAME = input(\"DeepLake username: \")"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "markdown",
82
+ "metadata": {},
83
+ "source": [
84
+ "3. You should then be on the DeepLake dashboard. At the top, click 'Create API token'. You should see an empty table with the columns 'Name', 'Expiration date', and 'Token'.\n",
85
+ "4. Click the 'Create API token' button ![image.png]() at the right of the page, choose a name for the token, then click 'Create API token'. (You do not need to change the expiration date.)\n",
86
+ "5. Afterwards, you should see the table look something like this:"
87
+ ]
88
+ },
89
+ {
90
+ "cell_type": "markdown",
91
+ "metadata": {},
92
+ "source": [
93
+ "![image.png]()"
94
+ ]
95
+ },
96
+ {
97
+ "cell_type": "markdown",
98
+ "metadata": {},
99
+ "source": [
100
+ "6. Click the two overlaid squares ![image.png]() to copy the API key; then run the code below and paste it into the input text box and press Enter."
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "metadata": {},
107
+ "outputs": [],
108
+ "source": [
109
+ "os.environ['ACTIVELOOP_TOKEN'] = getpass(\"DeepLake API key: \")"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "markdown",
114
+ "metadata": {},
115
+ "source": [
116
+ "Finally, pick a name for your dataset. It doesn't matter what this is, but keep in mind that it will be shared with the students."
117
+ ]
118
+ },
119
+ {
120
+ "cell_type": "code",
121
+ "execution_count": null,
122
+ "metadata": {},
123
+ "outputs": [],
124
+ "source": [
125
+ "dataset_name = input(\"Enter a name for your dataset: \")"
126
+ ]
127
+ },
128
+ {
129
+ "cell_type": "markdown",
130
+ "metadata": {},
131
+ "source": [
132
+ "## Processing The Document(s)\n",
133
+ "\n",
134
+ "In this part, you will upload the documents you want the students / model to reference; the embeddings will be created from those documents.\n",
135
+ "\n",
136
+ "**Note: The embeddings of all the documents you share will be publicly available. Do not use this for any documents you want to keep private.**"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "markdown",
141
+ "metadata": {},
142
+ "source": [
143
+ "First, upload your documents to Google Colab. To do this:\n",
144
+ "1. Click on the 'file' icon ![image1_3.png]() at the bottom of the sidebar to the left of these instructions.\n",
145
+ "2. Click on the 'upload file' icon ![image2_1.png]() on the left of the Files toolbar.\n",
146
+ "3. Select all of the files you want to upload, then click 'Open'.\n",
147
+ "4. A warning should pop up. Click 'OK' to continue.\n",
148
+ "5. Wait until the spinning circle in the bottom of the 'Files' section disappears. This means that all of the files have been uploaded."
149
+ ]
150
+ },
151
+ {
152
+ "cell_type": "markdown",
153
+ "metadata": {},
154
+ "source": [
155
+ "### Adding YouTube Videos / Websites\n",
156
+ "If you have any websites or YouTube videos which also contain content which you want to put into your data lake, paste those links one at a time into the text box below, pressing 'Enter' after each one. Once you have entered all the links, press 'Enter' without typing anything to finish execution of the code cell.\n",
157
+ "\n",
158
+ "If you have no URLs to add, just click on the box and press 'Enter' without typing anything."
159
+ ]
160
+ },
161
+ {
162
+ "cell_type": "code",
163
+ "execution_count": null,
164
+ "metadata": {},
165
+ "outputs": [],
166
+ "source": [
167
+ "url_list = []\n",
168
+ "while (url := input(\"Enter a YouTube / website link: \")): url_list.append(url)"
169
+ ]
170
+ },
171
+ {
172
+ "cell_type": "markdown",
173
+ "metadata": {},
174
+ "source": [
175
+ "### Model for embeddings\n",
176
+ "\n",
177
+ "Below, you can choose a different model which will be used to create the embeddings. At the current time, only OpenAI models are supported. If you're not sure, the following setting should suffice."
178
+ ]
179
+ },
180
+ {
181
+ "cell_type": "code",
182
+ "execution_count": null,
183
+ "metadata": {},
184
+ "outputs": [],
185
+ "source": [
186
+ "model_name = 'text-embedding-ada-002'"
187
+ ]
188
+ },
189
+ {
190
+ "cell_type": "markdown",
191
+ "metadata": {},
192
+ "source": [
193
+ "## Embedding & Database Creation\n",
194
+ "\n",
195
+ "Now that you've made all of the relevant settings, click the \"Run\" arrow next to this code block, or select this cell and then click \"Run This Cell and All Below\" or \"Run All Below\". This will automatically execute the rest of the code so that your database can be created from your specifications.\n",
196
+ "\n",
197
+ "You can ignore any warnings that pop up, but if the code stops execution, read the error. If you cannot fix it, please contact the developer."
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "markdown",
202
+ "metadata": {},
203
+ "source": [
204
+ "### Library download and installation"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "code",
209
+ "execution_count": null,
210
+ "metadata": {},
211
+ "outputs": [],
212
+ "source": [
213
+ "# run this code if you're using Google Colab or don't have these packages installed in your computing environment\n",
214
+ "#! pip install git+https://<token>@github.com/vanderbilt-data-science/lo-achievement.git\n",
215
+ "#! pip install deeplake"
216
+ ]
217
+ },
218
+ {
219
+ "cell_type": "code",
220
+ "execution_count": null,
221
+ "metadata": {},
222
+ "outputs": [],
223
+ "source": [
224
+ "# basic libraries\n",
225
+ "import os\n",
226
+ "from getpass import getpass\n",
227
+ "from IPython.display import display, Markdown\n",
228
+ "\n",
229
+ "# libraries from our package\n",
230
+ "from ai_classroom_suite.PromptInteractionBase import *\n",
231
+ "from ai_classroom_suite.MediaVectorStores import *\n",
232
+ "\n",
233
+ "# from langchain\n",
234
+ "import deeplake\n",
235
+ "from langchain.vectorstores import DeepLake\n",
236
+ "from langchain.embeddings import OpenAIEmbeddings"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "execution_count": null,
242
+ "metadata": {},
243
+ "outputs": [],
244
+ "source": [
245
+ "#setup OpenAI API key\n",
246
+ "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
247
+ "openai.api_key = OPENAI_API_KEY"
248
+ ]
249
+ },
250
+ {
251
+ "cell_type": "code",
252
+ "execution_count": null,
253
+ "metadata": {},
254
+ "outputs": [],
255
+ "source": [
256
+ "# get transcripts from youtube URLs\n",
257
+ "yt_docs, yt_save_path = get_website_youtube_text_file(url_list)"
258
+ ]
259
+ },
260
+ {
261
+ "cell_type": "markdown",
262
+ "metadata": {},
263
+ "source": [
264
+ "Now, we'll create the embeddings and the vector store from the transcripts of the YouTube videos. Make sure that all your documents are shown in the output from the previous code cell, then continue execution."
265
+ ]
266
+ },
267
+ {
268
+ "cell_type": "code",
269
+ "execution_count": null,
270
+ "metadata": {},
271
+ "outputs": [],
272
+ "source": [
273
+ "# create document segments\n",
274
+ "doc_segments = rawtext_to_doc_split(yt_docs)"
275
+ ]
276
+ },
277
+ {
278
+ "cell_type": "markdown",
279
+ "metadata": {},
280
+ "source": [
281
+ "Make sure that all of your documents are shown in the output from the previous code cell, then continue execution."
282
+ ]
283
+ },
284
+ {
285
+ "cell_type": "code",
286
+ "execution_count": null,
287
+ "metadata": {},
288
+ "outputs": [],
289
+ "source": [
290
+ "# create embeddings\n",
291
+ "embeddings = OpenAIEmbeddings(model=model_name)\n",
292
+ "\n",
293
+ "### Dataset Creation ###\n",
294
+ "dataset_path = f\"hub://{DEEPLAKE_USERNAME}/{dataset_name}\"\n",
295
+ "db = DeepLake.from_documents(all_document_segments, dataset_path=dataset_path,\n",
296
+ " embedding=embeddings, public=True)"
297
+ ]
298
+ },
299
+ {
300
+ "cell_type": "markdown",
301
+ "metadata": {},
302
+ "source": [
303
+ "## Sharing With Students"
304
+ ]
305
+ },
306
+ {
307
+ "cell_type": "code",
308
+ "execution_count": null,
309
+ "metadata": {},
310
+ "outputs": [],
311
+ "source": [
312
+ "display(Markdown(f'''To let students access the repository, give them the following URL:\n",
313
+ "\n",
314
+ "`{dataset_path}`'''))"
315
+ ]
316
+ },
317
+ {
318
+ "cell_type": "markdown",
319
+ "metadata": {},
320
+ "source": [
321
+ "Distribute the URL above to students. They will copy and paste it into the LLM learning application, which then allows their models to use all of the documents you uploaded as reference sources when responding to or creating questions."
322
+ ]
323
+ }
324
+ ],
325
+ "metadata": {
326
+ "kernelspec": {
327
+ "display_name": "python3",
328
+ "name": "python3"
329
+ }
330
+ },
331
+ "nbformat": 4,
332
+ "nbformat_minor": 0
333
+ }
nbs/_quarto.yml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ project:
2
+ type: website
3
+
4
+ format:
5
+ html:
6
+ theme: cosmo
7
+ css: styles.css
8
+ toc: true
9
+
10
+ website:
11
+ twitter-card: true
12
+ open-graph: true
13
+ repo-actions: [issue]
14
+ navbar:
15
+ background: primary
16
+ search: true
17
+ sidebar:
18
+ style: floating
19
+
20
+ metadata-files: [nbdev.yml, sidebar.yml]
nbs/gradio_application.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
nbs/helper_utilities.ipynb ADDED
@@ -0,0 +1,405 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# helper_utilities.ipynb\n",
8
+ "> Helper functions for when we need to work with files in Google Colab or locally\n",
9
+ "\n",
10
+ "In this notebook, we write some generic code to help us interface more easily with loading in files."
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "raw",
15
+ "metadata": {},
16
+ "source": [
17
+ "---\n",
18
+ "skip_exec: true\n",
19
+ "---"
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "code",
24
+ "execution_count": null,
25
+ "metadata": {},
26
+ "outputs": [],
27
+ "source": [
28
+ "#| default_exp IOHelperUtilities"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": null,
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "#| export\n",
38
+ "import ipywidgets as widgets\n",
39
+ "from IPython.display import display, clear_output\n",
40
+ "from functools import partial\n",
41
+ "from ipyfilechooser import FileChooser\n",
42
+ "import os"
43
+ ]
44
+ },
45
+ {
46
+ "cell_type": "code",
47
+ "execution_count": null,
48
+ "metadata": {},
49
+ "outputs": [],
50
+ "source": [
51
+ "#| export\n",
52
+ "def check_is_colab():\n",
53
+ " \"\"\"\n",
54
+ " Check if the current environment is Google Colab.\n",
55
+ " \"\"\"\n",
56
+ " try:\n",
57
+ " import google.colab\n",
58
+ " return True\n",
59
+ " except:\n",
60
+ " return False"
61
+ ]
62
+ },
63
+ {
64
+ "cell_type": "code",
65
+ "execution_count": null,
66
+ "metadata": {},
67
+ "outputs": [],
68
+ "source": [
69
+ "assert not check_is_colab(), 'On this system, we should not be in Colab'"
70
+ ]
71
+ },
72
+ {
73
+ "cell_type": "markdown",
74
+ "metadata": {},
75
+ "source": [
76
+ "## File Choosers\n",
77
+ "Jupyter notebooks, the different IDEs, and ipywidgets currently (as of August 8, 2023) are not playing nice together, and also are misaligned in terms of versions. What works in Jupyter Lab at version 8 somehow doesn't work in Google Colab and changes are needed. Neither the Google Colab version or the Jupyter Lab version work with VSCode.\n",
78
+ "\n",
79
+ "While this is being worked out between VS Code developers and ipywidgets, we've found a mid-term solutions which requires another package. We implement and test this below (thanks, Code Interpreter!)"
80
+ ]
81
+ },
82
+ {
83
+ "cell_type": "code",
84
+ "execution_count": null,
85
+ "metadata": {},
86
+ "outputs": [],
87
+ "source": [
88
+ "#| export\n",
89
+ "class MultiFileChooser:\n",
90
+ " def __init__(self):\n",
91
+ " self.fc = FileChooser('.')\n",
92
+ " self.fc.title = \"Use the following file chooser to add each file individually.\\n You can remove files by clicking the remove button.\"\n",
93
+ " self.fc.use_dir_icons = True\n",
94
+ " self.fc.show_only_dirs = False\n",
95
+ " self.selected_files = []\n",
96
+ " \n",
97
+ " self.fc.register_callback(self.file_selected)\n",
98
+ " \n",
99
+ " self.output = widgets.Output()\n",
100
+ " \n",
101
+ " def file_selected(self, chooser):\n",
102
+ " if self.fc.selected is not None and self.fc.selected not in self.selected_files:\n",
103
+ " self.selected_files.append(self.fc.selected)\n",
104
+ " self.update_display()\n",
105
+ " \n",
106
+ " def update_display(self):\n",
107
+ " with self.output:\n",
108
+ " clear_output()\n",
109
+ " for this_file in self.selected_files:\n",
110
+ " remove_button = widgets.Button(description=\"Remove\", tooltip=\"Remove this file\")\n",
111
+ " remove_button.on_click(partial(self.remove_file, file=this_file))\n",
112
+ " display(widgets.HBox([widgets.Label(value=this_file), remove_button]))\n",
113
+ " \n",
114
+ " def remove_file(self, button, this_file):\n",
115
+ " if this_file in self.selected_files:\n",
116
+ " self.selected_files.remove(this_file)\n",
117
+ " self.update_display()\n",
118
+ " \n",
119
+ " def display(self):\n",
120
+ " display(self.fc, self.output)\n",
121
+ " \n",
122
+ " def get_selected_files(self):\n",
123
+ " return self.selected_files"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "markdown",
128
+ "metadata": {},
129
+ "source": [
130
+ "Now we test the file chooser very briefly to ensure that the results are as we desire."
131
+ ]
132
+ },
133
+ {
134
+ "cell_type": "code",
135
+ "execution_count": null,
136
+ "metadata": {},
137
+ "outputs": [
138
+ {
139
+ "data": {
140
+ "application/vnd.jupyter.widget-view+json": {
141
+ "model_id": "07ff01c3633c4addb72bfe758fb9c4e4",
142
+ "version_major": 2,
143
+ "version_minor": 0
144
+ },
145
+ "text/plain": [
146
+ "FileChooser(path='/workspaces/lo-achievement/nbs', filename='', title='Use the following file chooser to add e…"
147
+ ]
148
+ },
149
+ "metadata": {},
150
+ "output_type": "display_data"
151
+ },
152
+ {
153
+ "data": {
154
+ "application/vnd.jupyter.widget-view+json": {
155
+ "model_id": "1aa36e43d7254a2f8669a92e156b726c",
156
+ "version_major": 2,
157
+ "version_minor": 0
158
+ },
159
+ "text/plain": [
160
+ "Output()"
161
+ ]
162
+ },
163
+ "metadata": {},
164
+ "output_type": "display_data"
165
+ }
166
+ ],
167
+ "source": [
168
+ "# Create file chooser and interact\n",
169
+ "mfc = MultiFileChooser()\n",
170
+ "mfc.display()"
171
+ ]
172
+ },
173
+ {
174
+ "cell_type": "code",
175
+ "execution_count": null,
176
+ "metadata": {},
177
+ "outputs": [
178
+ {
179
+ "data": {
180
+ "text/plain": [
181
+ "['/workspaces/lo-achievement/nbs/_quarto.yml',\n",
182
+ " '/workspaces/lo-achievement/nbs/nbdev.yml']"
183
+ ]
184
+ },
185
+ "execution_count": null,
186
+ "metadata": {},
187
+ "output_type": "execute_result"
188
+ }
189
+ ],
190
+ "source": [
191
+ "# get files that were selected.\n",
192
+ "mfc.get_selected_files()"
193
+ ]
194
+ },
195
+ {
196
+ "cell_type": "markdown",
197
+ "metadata": {},
198
+ "source": [
199
+ "## File loading\n",
200
+ "Now, we implement a file chooser that will work across platforms, whether it be Google Colab or local environments."
201
+ ]
202
+ },
203
+ {
204
+ "cell_type": "code",
205
+ "execution_count": null,
206
+ "metadata": {},
207
+ "outputs": [],
208
+ "source": [
209
+ "#| export\n",
210
+ "def setup_drives(upload_set):\n",
211
+ "\n",
212
+ " upload_set = upload_set.lower()\n",
213
+ " uploaded = None\n",
214
+ "\n",
215
+ " # allow them to mount the drive if they chose Google Colab.\n",
216
+ " if upload_set == 'google drive':\n",
217
+ " if check_is_colab():\n",
218
+ " from google.colab import drive\n",
219
+ " drive.mount('/content/drive')\n",
220
+ " else:\n",
221
+ " raise ValueError(\"It looks like you're not on Google Colab. Google Drive mounting is currently only implemented for Google Colab.\")\n",
222
+ "\n",
223
+ " # Everything else means that they'll need to use a file chooser (including Google Drive)\n",
224
+ " if check_is_colab():\n",
225
+ " from google.colab import files\n",
226
+ " uploaded = files.upload()\n",
227
+ " else:\n",
228
+ " # Create file chooser and interact\n",
229
+ " mfc = MultiFileChooser()\n",
230
+ " mfc.display()\n",
231
+ " uploaded = mfc.get_selected_files()\n",
232
+ " \n",
233
+ " return uploaded"
234
+ ]
235
+ },
236
+ {
237
+ "cell_type": "code",
238
+ "execution_count": null,
239
+ "metadata": {},
240
+ "outputs": [
241
+ {
242
+ "data": {
243
+ "application/vnd.jupyter.widget-view+json": {
244
+ "model_id": "c88d8353fb0d4dc6a2df946ea2082e5f",
245
+ "version_major": 2,
246
+ "version_minor": 0
247
+ },
248
+ "text/plain": [
249
+ "FileChooser(path='/workspaces/lo-achievement/nbs', filename='', title='Use the following file chooser to add e…"
250
+ ]
251
+ },
252
+ "metadata": {},
253
+ "output_type": "display_data"
254
+ },
255
+ {
256
+ "data": {
257
+ "application/vnd.jupyter.widget-view+json": {
258
+ "model_id": "9b8ed072582b4e87ace35d2d59c3a82f",
259
+ "version_major": 2,
260
+ "version_minor": 0
261
+ },
262
+ "text/plain": [
263
+ "Output()"
264
+ ]
265
+ },
266
+ "metadata": {},
267
+ "output_type": "display_data"
268
+ }
269
+ ],
270
+ "source": [
271
+ "res = setup_drives('local drive')"
272
+ ]
273
+ },
274
+ {
275
+ "cell_type": "code",
276
+ "execution_count": null,
277
+ "metadata": {},
278
+ "outputs": [
279
+ {
280
+ "data": {
281
+ "text/plain": [
282
+ "['/workspaces/lo-achievement/nbs/_quarto.yml',\n",
283
+ " '/workspaces/lo-achievement/nbs/nbdev.yml']"
284
+ ]
285
+ },
286
+ "execution_count": null,
287
+ "metadata": {},
288
+ "output_type": "execute_result"
289
+ }
290
+ ],
291
+ "source": [
292
+ "res"
293
+ ]
294
+ },
295
+ {
296
+ "cell_type": "markdown",
297
+ "metadata": {},
298
+ "source": [
299
+ "Now, we'll verify the behavior of Google Drive. We'll wrap this in a try/except block so the code can run all the way through."
300
+ ]
301
+ },
302
+ {
303
+ "cell_type": "code",
304
+ "execution_count": null,
305
+ "metadata": {},
306
+ "outputs": [
307
+ {
308
+ "name": "stdout",
309
+ "output_type": "stream",
310
+ "text": [
311
+ "An exception of type ValueError occurred. Arguments:\n",
312
+ "It looks like you're not on Google Colab. Google Drive mounting is currently only implemented for Google Colab.\n"
313
+ ]
314
+ }
315
+ ],
316
+ "source": [
317
+ "try:\n",
318
+ " setup_drives('google drive')\n",
319
+ "except Exception as e:\n",
320
+ " print(f\"An exception of type {type(e).__name__} occurred. Arguments:\\n{e}\")"
321
+ ]
322
+ },
323
+ {
324
+ "cell_type": "markdown",
325
+ "metadata": {},
326
+ "source": [
327
+ "## Future expected implementation\n",
328
+ "\n",
329
+ "The following code is included as it works, just not in Visual Studio code. The current implementation of the File chooser is a bit inelegant, but this is due to the current limitations of the combination of the libraries and platforms. Once some errors with VS code can be updated, this code will be the preferable solution as it is more familiar to users."
330
+ ]
331
+ },
332
+ {
333
+ "cell_type": "code",
334
+ "execution_count": null,
335
+ "metadata": {},
336
+ "outputs": [],
337
+ "source": [
338
+ "import ipywidgets as widgets\n",
339
+ "from IPython.display import display\n",
340
+ "\n",
341
+ "class UniversalFileUpload:\n",
342
+ "\n",
343
+ " def __init__(self):\n",
344
+ " self.filelist = []\n",
345
+ " self.uploader = None\n",
346
+ " self.status_output = None\n",
347
+ " \n",
348
+ " def _process_upload(self, change):\n",
349
+ " self.status_output.clear_output()\n",
350
+ " with self.status_output:\n",
351
+ " print('What is happening?')\n",
352
+ " print(change)\n",
353
+ "\n",
354
+ " def process_uploads(self, change):\n",
355
+ " if change['new'] and change['new'] != None:\n",
356
+ " with self.status_output:\n",
357
+ " print(change)\n",
358
+ " \n",
359
+ " self.filelist = change['new']\n",
360
+ " \n",
361
+ " #get filenames and promt\n",
362
+ " fnames = [fileinfo['name'] for fileinfo in self.filelist['metadata']]\n",
363
+ " with self.status_output:\n",
364
+ " print('Uploaded files:', fnames)\n",
365
+ " \n",
366
+ " #clear it so it doesn't save state\n",
367
+ " self.uploader.close()\n",
368
+ " \n",
369
+ " def get_upload_value(self):\n",
370
+ " return self.filelist\n",
371
+ " \n",
372
+ " def choose_files(self):\n",
373
+ " self.uploader = widgets.FileUpload(accept='', multiple=True, description='cat')\n",
374
+ " self.status_output = widgets.Output()\n",
375
+ " self.file_output_box = widgets.VBox([self.uploader, self.status_output])\n",
376
+ " self.uploader.observe(self._process_upload)\n",
377
+ "\n",
378
+ " with self.status_output:\n",
379
+ " print('Waiting...')\n",
380
+ "\n",
381
+ " return self.file_output_box"
382
+ ]
383
+ },
384
+ {
385
+ "cell_type": "code",
386
+ "execution_count": null,
387
+ "metadata": {},
388
+ "outputs": [],
389
+ "source": [
390
+ "#test\n",
391
+ "ul = UniversalFileUpload()\n",
392
+ "ul.choose_files()"
393
+ ]
394
+ }
395
+ ],
396
+ "metadata": {
397
+ "kernelspec": {
398
+ "display_name": "python3",
399
+ "language": "python",
400
+ "name": "python3"
401
+ }
402
+ },
403
+ "nbformat": 4,
404
+ "nbformat_minor": 2
405
+ }
nbs/media_stores.ipynb ADDED
@@ -0,0 +1,920 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# media_stores.ipynb\n",
8
+ "> A notebook for storing all types of media as vector stores\n",
9
+ "\n",
10
+ "In this notebook, we'll implement the functionality required to interact with many types of media stores. This is - not just for text files and pdfs, but also for images, audio, and video.\n",
11
+ "\n",
12
+ "Below are some references for integration of different media types into vector stores.\n",
13
+ "\n",
14
+ "- YouTube: https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/youtube_audio\n",
15
+ "- Websites:\n",
16
+ " - https://js.langchain.com/docs/modules/indexes/document_loaders/examples/web_loaders/\n",
17
+ " - https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_base\n",
18
+ " - Extracting relevant information from website: https://www.oncrawl.com/technical-seo/extract-relevant-text-content-from-html-page/\n",
19
+ "\n",
20
+ ":::{.callout-caution}\n",
21
+ "These notebooks are development notebooks, meaning that they are meant to be run locally or somewhere that supports navigating a full repository (in other words, not Google Colab unless you clone the entire repository to drive and then mount the Drive-Repository.) However, it is expected if you're able to do all of those steps, you're likely also able to figure out the required pip installs for development there.\n",
22
+ ":::\n"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "raw",
27
+ "metadata": {},
28
+ "source": [
29
+ "---\n",
30
+ "skip_exec: true\n",
31
+ "---"
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "code",
36
+ "execution_count": null,
37
+ "metadata": {},
38
+ "outputs": [],
39
+ "source": [
40
+ "#| default_exp MediaVectorStores"
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "code",
45
+ "execution_count": null,
46
+ "metadata": {},
47
+ "outputs": [],
48
+ "source": [
49
+ "#| export\n",
50
+ "# import libraries here\n",
51
+ "import os\n",
52
+ "import itertools\n",
53
+ "\n",
54
+ "from langchain.embeddings import OpenAIEmbeddings\n",
55
+ "\n",
56
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
57
+ "from langchain.document_loaders.unstructured import UnstructuredFileLoader\n",
58
+ "from langchain.document_loaders.generic import GenericLoader\n",
59
+ "from langchain.document_loaders.parsers import OpenAIWhisperParser\n",
60
+ "from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader\n",
61
+ "from langchain.document_loaders import WebBaseLoader, UnstructuredURLLoader\n",
62
+ "from langchain.docstore.document import Document\n",
63
+ "\n",
64
+ "from langchain.vectorstores import Chroma\n",
65
+ "from langchain.chains import RetrievalQAWithSourcesChain"
66
+ ]
67
+ },
68
+ {
69
+ "cell_type": "markdown",
70
+ "metadata": {},
71
+ "source": [
72
+ "Note that we will not export the following packages to our module because in this exploration we have decided to go with langchain implementations, or they are only used for testing."
73
+ ]
74
+ },
75
+ {
76
+ "cell_type": "code",
77
+ "execution_count": null,
78
+ "metadata": {},
79
+ "outputs": [],
80
+ "source": [
81
+ "#exploration\n",
82
+ "import trafilatura\n",
83
+ "import requests\n",
84
+ "import justext"
85
+ ]
86
+ },
87
+ {
88
+ "cell_type": "markdown",
89
+ "metadata": {},
90
+ "source": [
91
+ "## Media to Text Converters\n",
92
+ "In this section, we provide a set of converters that can either read text and convert it to other useful text, or read YouTube or Websites and convert them into text."
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "markdown",
97
+ "metadata": {},
98
+ "source": [
99
+ "### Standard Text Splitter\n",
100
+ "Here we define a standard text splitter. This can be used on any text."
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "metadata": {},
107
+ "outputs": [],
108
+ "source": [
109
+ "#| export\n",
110
+ "def rawtext_to_doc_split(text, chunk_size=1500, chunk_overlap=150):\n",
111
+ " \n",
112
+ " # Quick type checking\n",
113
+ " if not isinstance(text, list):\n",
114
+ " text = [text]\n",
115
+ "\n",
116
+ " # Create splitter\n",
117
+ " text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,\n",
118
+ " chunk_overlap=chunk_overlap,\n",
119
+ " add_start_index = True)\n",
120
+ " \n",
121
+ " #Split into docs segments\n",
122
+ " if isinstance(text[0], Document):\n",
123
+ " doc_segments = text_splitter.split_documents(text)\n",
124
+ " else:\n",
125
+ " doc_segments = text_splitter.split_documents(text_splitter.create_documents(text))\n",
126
+ "\n",
127
+ " # Make into one big list\n",
128
+ " doc_segments = list(itertools.chain(*doc_segments)) if isinstance(doc_segments[0], list) else doc_segments\n",
129
+ "\n",
130
+ " return doc_segments"
131
+ ]
132
+ },
133
+ {
134
+ "cell_type": "code",
135
+ "execution_count": null,
136
+ "metadata": {},
137
+ "outputs": [
138
+ {
139
+ "data": {
140
+ "text/plain": [
141
+ "[Document(page_content='This is a', metadata={}),\n",
142
+ " Document(page_content='sentence.', metadata={}),\n",
143
+ " Document(page_content='This is', metadata={}),\n",
144
+ " Document(page_content='another', metadata={}),\n",
145
+ " Document(page_content='sentence.', metadata={}),\n",
146
+ " Document(page_content='This is a', metadata={}),\n",
147
+ " Document(page_content='a third', metadata={}),\n",
148
+ " Document(page_content='sentence.', metadata={})]"
149
+ ]
150
+ },
151
+ "execution_count": null,
152
+ "metadata": {},
153
+ "output_type": "execute_result"
154
+ }
155
+ ],
156
+ "source": [
157
+ "# test basic functionality\n",
158
+ "rawtext_to_doc_split([\"This is a sentence. This is another sentence.\", \"This is a third sentence.\"], chunk_size=10, chunk_overlap=5)"
159
+ ]
160
+ },
161
+ {
162
+ "cell_type": "markdown",
163
+ "metadata": {},
164
+ "source": [
165
+ "We'll write a quick function to do a unit test on the function we just wrote."
166
+ ]
167
+ },
168
+ {
169
+ "cell_type": "code",
170
+ "execution_count": null,
171
+ "metadata": {},
172
+ "outputs": [],
173
+ "source": [
174
+ "def test_split_texts():\n",
175
+ " \n",
176
+ " # basic behavior\n",
177
+ " text = \"This is a sample text that we will use to test the splitter function.\"\n",
178
+ " expected_output = [\"This is a sample text that we will use to test the splitter function.\"]\n",
179
+ " out_splits = [doc.page_content for doc in rawtext_to_doc_split(text)]\n",
180
+ " assert all([target==expected for target, expected in zip(expected_output, out_splits)]), ('The basic splitter functionality is incorrect, and does not correctly ' +\n",
181
+ " 'use chunk_size and chunk_overlap on chunks <1500.')\n",
182
+ " \n",
183
+ " # try a known result with variable chunk_length and chunk_overlap\n",
184
+ " text = (\"This is a sample text that we will use to test the splitter function. It should split the \" +\n",
185
+ " \"text into multiple chunks of size 1500 with an overlap of 150 characters. This is the second chunk.\")\n",
186
+ " expected_output = ['This is a sample text that we will use to test the',\n",
187
+ " 'test the splitter function. It should split the',\n",
188
+ " 'split the text into multiple chunks of size 1500',\n",
189
+ " 'size 1500 with an overlap of 150 characters. This',\n",
190
+ " 'This is the second chunk.']\n",
191
+ " out_splits = [doc.page_content for doc in rawtext_to_doc_split(text, 50, 10)]\n",
192
+ " assert all([target==expected for target, expected in zip(expected_output, out_splits)]), 'The splitter does not correctly use chunk_size and chunk_overlap.'\n",
193
+ "\n",
194
+ "# Run test\n",
195
+ "test_split_texts()"
196
+ ]
197
+ },
198
+ {
199
+ "cell_type": "markdown",
200
+ "metadata": {},
201
+ "source": [
202
+ "The following function is used for testing to make sure single files and lists can be accommodated, and that what are returned are lists of documents."
203
+ ]
204
+ },
205
+ {
206
+ "cell_type": "code",
207
+ "execution_count": null,
208
+ "metadata": {},
209
+ "outputs": [],
210
+ "source": [
211
+ "# a set of tests to make sure that this works on both lists single inputs\n",
212
+ "def test_converters_inputs(test_fcn, files_list=None):\n",
213
+ " if files_list is None:\n",
214
+ " single_file = 'The cat was super cute and adorable'\n",
215
+ " multiple_files = [single_file, 'The dog was also cute and her wet nose is always so cold!']\n",
216
+ " elif isinstance(files_list, str):\n",
217
+ " single_file = files_list\n",
218
+ " multiple_files = [single_file, single_file]\n",
219
+ " elif isinstance(files_list, list):\n",
220
+ " single_file = files_list[0]\n",
221
+ " multiple_files = files_list\n",
222
+ " else:\n",
223
+ " TypeError(\"You've passed in a files_list which is neither a string or a list or None\")\n",
224
+ "\n",
225
+ " # test for single file\n",
226
+ " res = test_fcn(single_file)\n",
227
+ " assert isinstance(res, list), 'FAILED ASSERT in {test_fcn}. A single file should return a list.'\n",
228
+ " assert not isinstance(res[0], list), 'FAILED ASSERT in {test_fcn}. A single file should return a 1-dimensional list.'\n",
229
+ "\n",
230
+ " # test for multiple files\n",
231
+ " res = test_fcn(multiple_files)\n",
232
+ " assert isinstance(res, list), 'FAILED ASSERT in {test_fcn}. A list of files should return a list.'\n",
233
+ " assert not isinstance(res[0], list), 'FAILED ASSERT in {test_fcn}. A list of files should return a 1-dimensional list with all documents combined.'\n",
234
+ "\n",
235
+ " # test that the return type of elements should be Document\n",
236
+ " assert all([isinstance(doc, Document) for doc in res]), 'FAILED ASSERT in {test_fcn}. The return type of elements should be Document.'"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "execution_count": null,
242
+ "metadata": {},
243
+ "outputs": [],
244
+ "source": [
245
+ "# test behavior of standard text splitter\n",
246
+ "test_converters_inputs(rawtext_to_doc_split)"
247
+ ]
248
+ },
249
+ {
250
+ "cell_type": "markdown",
251
+ "metadata": {},
252
+ "source": [
253
+ "### File or Files\n",
254
+ "Functions which load a single file or files from a directory, including pdfs, text files, html, images, and more. See [Unstructured File Documentation](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) for more information."
255
+ ]
256
+ },
257
+ {
258
+ "cell_type": "code",
259
+ "execution_count": null,
260
+ "metadata": {},
261
+ "outputs": [],
262
+ "source": [
263
+ "#| export\n",
264
+ "## A single File\n",
265
+ "def _file_to_text(single_file, chunk_size = 1000, chunk_overlap=150):\n",
266
+ "\n",
267
+ " # Create loader and get segments\n",
268
+ " loader = UnstructuredFileLoader(single_file)\n",
269
+ " doc_segments = loader.load_and_split(RecursiveCharacterTextSplitter(chunk_size=chunk_size,\n",
270
+ " chunk_overlap=chunk_overlap,\n",
271
+ " add_start_index=True))\n",
272
+ " return doc_segments\n",
273
+ "\n",
274
+ "\n",
275
+ "## Multiple files\n",
276
+ "def files_to_text(files_list, chunk_size=1000, chunk_overlap=150):\n",
277
+ " \n",
278
+ " # Quick type checking\n",
279
+ " if not isinstance(files_list, list):\n",
280
+ " files_list = [files_list]\n",
281
+ "\n",
282
+ " # This is currently a fix because the UnstructuredFileLoader expects a list of files yet can't split them correctly yet\n",
283
+ " all_segments = [_file_to_text(single_file, chunk_size=chunk_size, chunk_overlap=chunk_overlap) for single_file in files_list]\n",
284
+ " all_segments = list(itertools.chain(*all_segments)) if isinstance(all_segments[0], list) else all_segments\n",
285
+ "\n",
286
+ " return all_segments"
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "code",
291
+ "execution_count": null,
292
+ "metadata": {},
293
+ "outputs": [
294
+ {
295
+ "data": {
296
+ "text/plain": [
297
+ "[Document(page_content='Two roads diverged in a yellow wood,\\rAnd sorry I could not travel both\\rAnd be one traveler, long I', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),\n",
298
+ " Document(page_content='traveler, long I stood\\rAnd looked down one as far as I could\\rTo where it bent in the', metadata={'source': '../roadnottaken.txt', 'start_index': 82}),\n",
299
+ " Document(page_content='it bent in the undergrowth;\\r\\rThen took the other, as just as fair,\\rAnd having perhaps the better', metadata={'source': '../roadnottaken.txt', 'start_index': 152}),\n",
300
+ " Document(page_content='perhaps the better claim,\\rBecause it was grassy and wanted wear;\\rThough as for that the passing', metadata={'source': '../roadnottaken.txt', 'start_index': 230}),\n",
301
+ " Document(page_content='that the passing there\\rHad worn them really about the same,\\r\\rAnd both that morning equally lay\\rIn', metadata={'source': '../roadnottaken.txt', 'start_index': 309}),\n",
302
+ " Document(page_content='equally lay\\rIn leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing', metadata={'source': '../roadnottaken.txt', 'start_index': 392}),\n",
303
+ " Document(page_content='day! Yet knowing how way leads on to way,\\rI doubted if I should ever come back. I shall be telling', metadata={'source': '../roadnottaken.txt', 'start_index': 474}),\n",
304
+ " Document(page_content='I shall be telling this with a sigh\\rSomewhere ages and ages hence:\\rTwo roads diverged in a wood,', metadata={'source': '../roadnottaken.txt', 'start_index': 554}),\n",
305
+ " Document(page_content='diverged in a wood, and IэI took the one less traveled by,\\rAnd that has made all the difference.', metadata={'source': '../roadnottaken.txt', 'start_index': 631}),\n",
306
+ " Document(page_content='Two roads diverged in a yellow wood,\\rAnd sorry I could not travel both\\rAnd be one traveler, long I', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),\n",
307
+ " Document(page_content='traveler, long I stood\\rAnd looked down one as far as I could\\rTo where it bent in the', metadata={'source': '../roadnottaken.txt', 'start_index': 82})]"
308
+ ]
309
+ },
310
+ "execution_count": null,
311
+ "metadata": {},
312
+ "output_type": "execute_result"
313
+ }
314
+ ],
315
+ "source": [
316
+ "# ensure basic behavior\n",
317
+ "res = files_to_text(['../roadnottaken.txt', '../roadnottaken.txt'], chunk_size=100, chunk_overlap=20)\n",
318
+ "res[:11]"
319
+ ]
320
+ },
321
+ {
322
+ "cell_type": "code",
323
+ "execution_count": null,
324
+ "metadata": {},
325
+ "outputs": [],
326
+ "source": [
327
+ "test_converters_inputs(files_to_text, '../roadnottaken.txt')"
328
+ ]
329
+ },
330
+ {
331
+ "cell_type": "markdown",
332
+ "metadata": {},
333
+ "source": [
334
+ "### Youtube\n",
335
+ "This works by first transcribing the video to text."
336
+ ]
337
+ },
338
+ {
339
+ "cell_type": "code",
340
+ "execution_count": null,
341
+ "metadata": {},
342
+ "outputs": [],
343
+ "source": [
344
+ "#| export\n",
345
+ "def youtube_to_text(urls, save_dir = \"content\"):\n",
346
+ " # Transcribe the videos to text\n",
347
+ " # save_dir: directory to save audio files\n",
348
+ "\n",
349
+ " if not isinstance(urls, list):\n",
350
+ " urls = [urls]\n",
351
+ " \n",
352
+ " youtube_loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())\n",
353
+ " youtube_docs = youtube_loader.load()\n",
354
+ " \n",
355
+ " return youtube_docs"
356
+ ]
357
+ },
358
+ {
359
+ "cell_type": "markdown",
360
+ "metadata": {},
361
+ "source": [
362
+ "Now, let's demonstrate functionality using some existing YouTube videos"
363
+ ]
364
+ },
365
+ {
366
+ "cell_type": "code",
367
+ "execution_count": null,
368
+ "metadata": {},
369
+ "outputs": [],
370
+ "source": [
371
+ "# Two Karpathy lecture videos\n",
372
+ "urls = [\"https://youtu.be/kCc8FmEb1nY\", \"https://youtu.be/VMj-3S1tku0\"]\n",
373
+ "youtube_text = youtube_to_text(urls)\n",
374
+ "youtube_text"
375
+ ]
376
+ },
377
+ {
378
+ "cell_type": "markdown",
379
+ "metadata": {},
380
+ "source": [
381
+ "Other Youtube helper functions to help with getting full features of YouTube videos are included below. These two grab and save the text of the transcripts.\n",
382
+ "\n",
383
+ "<p style=\"color:red\"><strong>Note that in this stage of development, the following cannot be tested due to YouTube download errors.</strong></p>"
384
+ ]
385
+ },
386
+ {
387
+ "cell_type": "code",
388
+ "execution_count": null,
389
+ "metadata": {},
390
+ "outputs": [],
391
+ "source": [
392
+ "#| export\n",
393
+ "def save_text(text, text_name = None):\n",
394
+ " if not text_name:\n",
395
+ " text_name = text[:20]\n",
396
+ " text_path = os.path.join(\"/content\",text_name+\".txt\")\n",
397
+ " \n",
398
+ " with open(text_path, \"x\") as f:\n",
399
+ " f.write(text)\n",
400
+ " # Return the location at which the transcript is saved\n",
401
+ " return text_path"
402
+ ]
403
+ },
404
+ {
405
+ "cell_type": "code",
406
+ "execution_count": null,
407
+ "metadata": {},
408
+ "outputs": [],
409
+ "source": [
410
+ "#| export\n",
411
+ "def get_youtube_transcript(yt_url, save_transcript = False, temp_audio_dir = \"sample_data\"):\n",
412
+ " # Transcribe the videos to text and save to file in /content\n",
413
+ " # save_dir: directory to save audio files\n",
414
+ "\n",
415
+ " youtube_docs = youtube_to_text(yt_url, save_dir = temp_audio_dir)\n",
416
+ " \n",
417
+ " # Combine doc\n",
418
+ " combined_docs = [doc.page_content for doc in youtube_docs]\n",
419
+ " combined_text = \" \".join(combined_docs)\n",
420
+ " \n",
421
+ " # Save text to file\n",
422
+ " video_path = youtube_docs[0].metadata[\"source\"]\n",
423
+ " youtube_name = os.path.splitext(os.path.basename(video_path))[0]\n",
424
+ "\n",
425
+ " save_path = None\n",
426
+ " if save_transcript:\n",
427
+ " save_path = save_text(combined_text, youtube_name)\n",
428
+ " \n",
429
+ " return youtube_docs, save_path"
430
+ ]
431
+ },
432
+ {
433
+ "cell_type": "markdown",
434
+ "metadata": {},
435
+ "source": [
436
+ "### Websites\n",
437
+ "We have a few different approaches to reading website text. Some approaches are specifically provided through langchain and some are other packages that seem to be performant. We'll show the pros/cons of each approach below.\n",
438
+ "\n",
439
+ "#### Langchain: WebBaseLoader"
440
+ ]
441
+ },
442
+ {
443
+ "cell_type": "code",
444
+ "execution_count": null,
445
+ "metadata": {},
446
+ "outputs": [],
447
+ "source": [
448
+ "#| export\n",
449
+ "def website_to_text_web(url, chunk_size = 1500, chunk_overlap=100):\n",
450
+ " \n",
451
+ " # Url can be a single string or list\n",
452
+ " website_loader = WebBaseLoader(url)\n",
453
+ " website_raw = website_loader.load()\n",
454
+ "\n",
455
+ " website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)\n",
456
+ " \n",
457
+ " # Combine doc\n",
458
+ " return website_data"
459
+ ]
460
+ },
461
+ {
462
+ "cell_type": "markdown",
463
+ "metadata": {},
464
+ "source": [
465
+ "Now for a quick test to ensure functionality..."
466
+ ]
467
+ },
468
+ {
469
+ "cell_type": "code",
470
+ "execution_count": null,
471
+ "metadata": {},
472
+ "outputs": [],
473
+ "source": [
474
+ "demo_urls = [\"https://www.espn.com/\", \"https://www.vanderbilt.edu/undergrad-datascience/faq\"]"
475
+ ]
476
+ },
477
+ {
478
+ "cell_type": "code",
479
+ "execution_count": null,
480
+ "metadata": {},
481
+ "outputs": [
482
+ {
483
+ "data": {
484
+ "text/plain": [
485
+ "[Document(page_content=\"ESPN - Serving Sports Fans. Anytime. Anywhere.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n Skip to main content\\n \\n\\n Skip to navigation\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n<\\n\\n>\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMenuESPN\\n\\n\\nSearch\\n\\n\\n\\nscores\\n\\n\\n\\nNFLMLBNBANHLSoccerGolf…Women's World CupNCAAFNCAAMNCAAWSports BettingBoxingCFLNCAACricketF1HorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyTennisWNBAWWEX GamesXFLMore ESPNFantasyListenWatchESPN+\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\nSUBSCRIBE NOW\\n\\n\\n\\n\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPGA TOUR LIVE\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nCrossFit Games\\n\\n\\n\\n\\n\\n\\n\\nSlamBall\\n\\n\\n\\n\\n\\n\\n\\nThe Ultimate Fighter: Season 31\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\n\\nQuick Links\\n\\n\\n\\n\\nWomen's World Cup\\n\\n\\n\\n\\n\\n\\n\\nNHL Free Agency\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nNBA Trade Machine\\n\\n\\n\\n\\n\\n\\n\\nThe Basketball Tournament\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\n\\n\\n\\n\\nFavorites\\n\\n\\n\\n\\n\\n\\n Manage Favorites\\n \\n\\n\\n\\nCustomize ESPNSign UpLog InESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
486
+ " Document(page_content=\"How can your team win the national title? Connelly breaks down what needs to go right for all 17 contendersThe fewer things that have to go right to win a title, the better a team's chances of taking the crown. Here's what has to fall each contender's way.7hBill ConnellyDale Zanine/USA TODAY SportsPosition U 2023: Is USC on the verge of taking over QBU from Oklahoma?Which schools produce the most talent at each position?1dDavid HaleConnelly's conference previews: Intel on all 133 FBS teamsTOP HEADLINESFreeze 'uncomfortable' as Auburn opens campTexans' Metchie relied on faith amid cancer fightHornets have new owners after MJ sale finalizedMiami coach expects rough treatment of MessiDrexel basketball player found dead in apartmentGermany exits WWC after draw with South KoreaBrady takes minority stake in English soccer teamDeep dish: Cubs' output at plate best since 1897Re-drafting 2018 NFL class 5 years laterWHAT HAPPENED IN INDY?Inside the shocking feud between Jonathan Taylor and the ColtsHe was the NFL's leading rusher two seasons ago and wanted an extension with the Colts, but now he wants out. How things got so bad for Taylor and Indianapolis.8hStephen HolderZach Bolinger/Icon Sportswire'THE BEST IN THE WORLD RIGHT NOW'Why Stephen A. is convinced Tyreek Hill is the NFL's top WR2h2:57WYNDHAM CHAMPIONSHIPCONTINUES THROUGH SUNDAYShane Lowry fluffs shot, drains birdie chip immediately after3h0:35Countdown to FedEx Cup Playoffs, AIG Open and the Ryder CupDiana Taurasi, 10,000\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
487
+ " Document(page_content=\"Lowry fluffs shot, drains birdie chip immediately after3h0:35Countdown to FedEx Cup Playoffs, AIG Open and the Ryder CupDiana Taurasi, 10,000 points and the shot that made WNBA scoring history1dMLB SCOREBOARDTHURSDAY'S GAMESSee AllTrivia: Can you guess the right player?HERE COMES HELPBring on the reinforcements! 10 returning players as good as a trade deadline blockbusterInjured stars expected to come off the IL soon -- or have already -- could rock MLB's playoff races.7hAlden GonzalezJay Biggerstaff-USA TODAY Sports'CLEARLY THE ACC IS STRUGGLING'Finebaum: FSU is better off leaving the ACC5h1:04Thamel's realignment buzz: Latest on Pac-12, Big 12 and ACCAN AGGRESSIVE STRATEGYHow the Big 12 landed Colorado and shook up college footballThe Big 12 learned lessons two years ago after getting burned by Texas and Oklahoma. It resulted in a more aggressive strategy that could dramatically change the sport.2dHeather DinichRaymond Carlin/Icon Sportswire Top HeadlinesFreeze 'uncomfortable' as Auburn opens campTexans' Metchie relied on faith amid cancer fightHornets have new owners after MJ sale finalizedMiami coach expects rough treatment of MessiDrexel basketball player found dead in apartmentGermany exits WWC after draw with South KoreaBrady takes minority stake in English soccer teamDeep dish: Cubs' output at plate best since 1897Re-drafting 2018 NFL class 5 years laterFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNSign UpLog InICYMI0:54Serena Williams, Alexis Ohanian\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
488
+ " Document(page_content='2018 NFL class 5 years laterFavorites FantasyManage FavoritesFantasy HomeCustomize ESPNSign UpLog InICYMI0:54Serena Williams, Alexis Ohanian use drones to reveal gender of 2nd childSerena Williams and her husband Alexis Ohanian find out the gender of their second child in a spectacular display of drones. Best of ESPN+Todd Kirkland/Getty ImagesMLB 2023 trade deadline: Winners, losers and in-betweenersThe 2023 trade deadline is over! Who crushed it, and who left much to be desired? We weigh in on all 30 clubs.AP Photo/Matt YorkLowe: Why Bradley Beal could unlock KD, Book and the most dangerous version of the Phoenix Suns yetWith Kevin Durant, Devin Booker and Beal, Phoenix is already an inner-circle title contender. But if the Suns continue a Beal experiment the Wizards ran last season? Good luck.Cliff Welch/Icon SportswirePredicting 10 NFL starting quarterback battles: Who is QB1?We talked to people around the NFL and projected the QB1 for 10 unsettled situations, including a wide-open race in Tampa Bay. Trending NowAP Photo/Julio Cortez\\'Revis Island\\' resonates long after Hall of Famer\\'s retirementDarrelle Revis made his name as a dominant corner but might be best known for his \"island\" moniker players still adopt today.Illustration by ESPNThe wild life of Gardner MinshewFour colleges, three NFL teams, two Manias and the hug that broke the internet. It\\'s been an unbelievable ride for Gardner Minshew. Next stop: Indianapolis.Illustration by ESPNBest 2023 Women\\'s World Cup', metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
489
+ " Document(page_content=\"that broke the internet. It's been an unbelievable ride for Gardner Minshew. Next stop: Indianapolis.Illustration by ESPNBest 2023 Women's World Cup players: Morgan, Caicedo, moreESPN's expert panel selected the top 25 players of the Women's World Cup to keep an eye on, from Sophia Smith to Sam Kerr and more. How to Watch on ESPN+(AP Photo/Koji Sasahara, File)How to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN, ESPN+Here's everything you need to know about how to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN and ESPN+. Sign up to play the #1 Fantasy game!Create A LeagueJoin Public LeagueReactivateMock Draft NowSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice With a Mock DraftSign up for FREE!Create A LeagueJoin a Public LeagueReactivate a LeaguePractice with a Mock Draft\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
490
+ " Document(page_content=\"ESPN+\\n\\n\\n\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nPGA TOUR LIVE\\n\\n\\n\\n\\n\\n\\n\\nLittle League Baseball: Regionals\\n\\n\\n\\n\\n\\n\\n\\nMLB: Select Games\\n\\n\\n\\n\\n\\n\\n\\nCrossFit Games\\n\\n\\n\\n\\n\\n\\n\\nSlamBall\\n\\n\\n\\n\\n\\n\\n\\nThe Ultimate Fighter: Season 31\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\n\\nQuick Links\\n\\n\\n\\n\\nWomen's World Cup\\n\\n\\n\\n\\n\\n\\n\\nNHL Free Agency\\n\\n\\n\\n\\n\\n\\n\\nNBA Free Agency Buzz\\n\\n\\n\\n\\n\\n\\n\\nNBA Trade Machine\\n\\n\\n\\n\\n\\n\\n\\nThe Basketball Tournament\\n\\n\\n\\n\\n\\n\\n\\nFantasy Football: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch PGA TOUR\\n\\n\\nESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nX/Twitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nTikTok\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\nTerms of UsePrivacy PolicyYour US State Privacy RightsChildren's Online Privacy PolicyInterest-Based AdsAbout Nielsen MeasurementDo Not Sell or Share My Personal InformationContact UsDisney Ad Sales SiteWork for ESPNCopyright: © ESPN Enterprises, Inc. All rights reserved.\", metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}),\n",
491
+ " Document(page_content='Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University\\n\\n\\n\\n\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nSkip to main content\\n\\nlink\\n\\n\\n\\n\\n\\nHome\\nPeople\\nMinor\\n\\nMinor Requirements\\nCourse Descriptions\\nCourse Schedule\\nHow to Declare the Minor\\nChoosing a Minor\\n\\n\\nResearch and Immersion\\n\\nResearch and Immersion Overview\\nDS 3850 Research in Data Science\\nDSI Summer Research Program\\nData Science for Social Good\\nResearch Immersion in Data Science\\nDSI Internship\\n\\n\\nFAQ\\nNews\\nForms\\nContact and Email List\\nData Science Institute\\n \\n\\n\\n\\n\\n\\n\\n\\t\\t\\t\\t\\t\\tUndergraduate Data Science \\n\\n\\n\\n\\n\\n\\n\\n\\n\\nFrequently Asked Questions\\nDeclaring the Minor\\n\\n\\n\\nHow do I declare the Data Science Minor?Use the forms and follow the procedures for your home college. See How to Declare the Data Science Minor.\\n\\n\\nWhen should I declare the Data Science Minor?While minor declarations can be made any time, DS courses will give some preference to students who have officially declared the Data Science Minor. So we recommend declaring the minor sooner rather than later. It is always possible to drop a declared minor. Minor declarations must be submitted at least two weeks before registration begins. Otherwise, the minor declaration will not be processed until after registration. No preference will be given during registration for an “intent” to declare because the minor declaration was made too late.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
492
+ " Document(page_content='I declared the Data Science Minor, but I did not get into the class I wanted to take for the minor. Why?First, preference for students who have declared the minor only applies to DS courses, not other courses. Second, if you declared the minor within two weeks of registration, your minor declaration will. not show up on YES, and you will not have preference. Third, while we try to hold as many seats for students who have declared the minor as we can, not all seats are reserved.\\n\\n\\nI am a first-year A&S student. Can I really declare the Data Science Minor now?Yes. While A&S students are usually prevented from declaring a major or minor until sophomore year, first-year A&S students can declare the Data Science Minor. As noted in the previous question, this can be important to do since some popular core DS courses will give some preference to students who have officially declared Data Science as a minor.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
493
+ " Document(page_content='I am a current junior (rising senior), can I complete the Data Science Minor (for Spring 2021 juniors only)?Juniors must contact the Director of Undergraduate Data Science to discuss options. DS 1000 is not open to current juniors (rising seniors). DS 3100 will not be taught next year (Fall 2021 or Spring 2022) and will need to be suitably replaced, which will require an approved plan from the Director. Furthermore, while DS / CS 3262 is current slated to be taught Spring 2022, that is not fully guaranteed, so students should see if they can take one of the other machine learning options.\\n\\n\\nI am a rising senior or current senior and cannot register for DS 1000. Why?Rising seniors and current seniors can only register for DS 1000 if there are available seats immediately before the semester begins with permission of the instructor. DS 1000 is intended as an introduction to data science for first years and sophomores, which is why this restriction is in place.\\n\\n\\nCollege-Specific Information\\n\\n\\n\\nWhat college is the home of the Data Science Minor?The Data Science Minor is a trans-institutional minor, shared by A&S, Blair, Engineering, and Peabody.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
494
+ " Document(page_content='I am an A&S student. Do DS courses count as A&S courses?All courses with a DS prefix count as courses within each of the colleges, including A&S. If you are an A&S student, and are taking a course that is cross-listed, make sure you enroll in the one with the DS prefix. Electives outside of A&S without the DS prefix will generally not count as A&S courses, so plan accordingly.\\n\\n\\nWhat are the unique credit hour rules for the Data Science Minor?Students electing an undergraduate minor in Data Science must follow academic regulations regarding minors in their home college, including but not limited to regulations regarding unique hours. The unique credit hour rule is specific to the College of Arts and Science and Peabody College. The School of Engineering and Blair School of Music do not have a unique credit hour rule. The Data Science minor cannot waive this rule. Please talk with your academic advisor about how to satisfy these requirements.\\n\\n\\nInfo About the Courses\\n\\n\\n\\nDS 1000Thank you for your interest in DS 1000! The course is full for the fall 2021 semester. Due to student demand and the transinstitutional nature of the course, we cannot make special exceptions as to which students, if any, on the waitlist are able to enroll. DS 1000 will be offered again in the spring semester.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
495
+ " Document(page_content='What computer programming course should I take?See What Programming Course To Take? In general, students interested in data science and scientific computing (not in computer science per se) should learn Python (and R).\\n\\n\\nHow do I find courses approved for the data science minor on YES?On YES, to select all courses approved for credit in the Data Science minor offered in a given semester, select the “Advanced” link next to the search box, select the “Class Attributes” drop-down box on the bottom right of the advanced search page, and then select “Eligible for Data Science” to find all courses. (Note that these course tags will not all be in place on YES until the registration period for Fall 2021 begins.)\\n\\n\\nCan other courses, besides those listed, count towards the Data Science Minor?New courses, special topics courses, or graduate-level courses that seem related to data science could count as electives. Contact the Director of Undergraduate Data Science to request consideration.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
496
+ " Document(page_content='Why doesn’t CS 1104 count towards the Data Science Minor?It does, as a prerequisite to CS 2204, which counts towards the minor. CS / DS 1100 was created as a new single-semester programming course for the Data Science Minor. It roughly has 2/3 the content of CS 1104 and 1/3 the content of CS 2204. While CS / DS 1100 counts as a single semester of programming for the minor, we strongly encourage students interested in data science, and in using data science tools and techniques, to take two semesters of programming in Python (CS / DS 1100 or CS 1104, followed by CS 2204). If you have taken CS 1104, you can take CS 1100, but you will only receive a total of four credits for\\xa0the two courses. See also What Programming Course To Take?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
497
+ " Document(page_content='I see that after having taken CS 1104, I can take CS/DS 1100 instead of taking CS 2204. What are the downsides of doing so?After taking CS 1104, we do recommend you take CS 2204. If you are interested in data science, a broader experience in Python in desirable (in fact, we recommend that students having taken CS 1100 try to take CS 2204 as well). CS/DS 1100 and 1104 have significant overlap (both are introductions to programming using Python). That said, it is permissible to take CS/DS 1100 after having taken CS 1104. You will only get 1 (out of 3) credit hours for CS/DS 1100 (after having taken CS 1104), but the combination of CS/DS 1100 and 1104 will satisfy the DS minor programming requirement. Note that if you enroll in three 3-hour courses and CS/DS 1100 (after having taken CS 1104) it will look like you are registered for 12 credit hours during registration and at the start of the semester, but your credit hours will be reduced to only 10 credit hours (because the credits for CS/DS 1100 will be cut back to 1 after the add/drop period). Enrolling in fewer than 12 credit hours can have significant consequences on financial aid and potentially on visa status for international students. Please be mindful of this.\\n\\n\\nWhat is the difference between CS 1100 and DS 1100?Nothing. They are the same course. They meet the same time in the same place and are taught by the same instructor. They are just cross-listed.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
498
+ " Document(page_content='I have taken CS 1101. What computer programming course should I take next?You have two options. You can either take CS 2201 (in C++) or take CS 1100 (in Python). Of course, you could also take CS 1104 and 2204 (in Python). CS 1100, 2201, and 2204 all satisfy the programming requirement for the minor. Note that CS 2201 is a prerequisite for many upper-level CS courses (as well as required for the CS major and minor). For more information, see What Programming Course To Take?\\n\\n\\nECON 3750 and MATH 3670 are listed both as satisfying the core machine learning requirement and as electives. If I take one, will it double-count for both requirements?No. They are listed under both because a student who takes one of the other machine learning\\xa0courses to satisfy the core requirement (CS/DS 3262 or CS 4262) can also take ECON 3750 or MATH 3670 as an elective; the content is sufficiently different that both can count towards the minor, but one course cannot double-count for two minor requirements.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
499
+ " Document(page_content='Can I take ECON 3750 or MATH 3670 as an elective if I have already taken CS 3262 or CS 4262?Yes (see above). ECON 3750 and MATH 3670 are sufficiently different from CS 3262 or CS 4262 (and from each other) that you can take these as electives. In fact, you could take ECON 3750 to satisfy the machine learning requirement and then take MATH 3670 as an elective.\\nCS 3262 can count towards the Data Science minor. CS 3262 does not count directly towards the Computer Science major requirements but could be used as either a tech elective or open elective for Computer Science majors.\\n\\n\\nWhy doesn’t MATH 2820 count towards the Data Science Minor?It does, as a prerequisite to MATH 2821, which counts towards the minor. The two-course sequence of MATH 2820 and MATH 2821 counts towards the Data Science Minor; the\\xa0two-course sequence is required because MATH 2820 goes deep into mathematical foundations of probability ad statistics concepts, but does not by itself cover the breadth of topics of other introductory statistics courses. This two-course sequence provides an excellent introduction to mathematical statistics.\\n\\n\\nResearch and Immersion Information\\n\\n\\n\\nCan I do research for course credit?Yes, you can do research for course credit (including DS 3850). More information can be found here: https://www.vanderbilt.edu/undergrad-datascience/ds-3850-research-in-data-science/', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
500
+ " Document(page_content='I am interested in the Undergraduate Data Science Immersion Program. How can I participate.Some competitive summer immersion programs include DSI-SPR and Data Science for Social Good (DSSG). More information can be found on the following websites.\\n\\nhttps://www.vanderbilt.edu/datascience/academics/undergraduate/summer-research-program/\\nhttps://www.vanderbilt.edu/datascience/data-science-for-social-good/\\n\\nTo get involved in data-science-oriented research with a faculty member, you will need to reach out to the faculty member. Pointers can be found here: https://www.vanderbilt.edu/undergrad-datascience/research-and-immersion-overview/. Having that research count towards the immersion requirement will be between your faculty mentor and your faculty immersion coordinator.\\nAdditional information about research opportunities will be posted on the website in the future.\\n\\xa0\\n\\n\\nContact\\n\\n\\n\\nHow do I ask a question about the Data Science?If you have questions about the Data Science Minor or Immersion opportunities in data science, please email us: undergrad.datascience@vanderbilt.edu\\n\\n\\nTo whom can I petition if the Director denies my request?The Governing Board of the Data Science Minor acts as the college-level oversight body for this trans-institutional minor and would be the appropriate next step for petitions related to the minor.\\n\\n\\n\\n\\n\\n\\n\\nData Science News\\n\\n\\n\\n Opportunities for Capstone Projects and Research Experience\\n\\n\\n\\n Attention Graduate Students! We’re Hiring!', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'}),\n",
501
+ " Document(page_content='Data Science News\\n\\n\\n\\n Opportunities for Capstone Projects and Research Experience\\n\\n\\n\\n Attention Graduate Students! We’re Hiring!\\n\\n\\n\\n Vanderbilt student-athlete drives sports performance through data analysis\\n\\n\\n\\n New Course: DS 3891 Special Topics: Intro to Generative AI\\n\\n\\n\\n Now Accepting Applications: DS Minor Teaching Fellowship for graduate students\\n\\n\\n\\n Join Our Team: Student Worker Positions Available for Fall 2023 Semester!\\n\\n\\n\\n\\n\\nVIEW MORE EVENTS >\\n\\n\\n\\n\\nYour Vanderbilt\\n\\nAlumni\\nCurrent Students\\nFaculty & Staff\\nInternational Students\\nMedia\\nParents & Family\\nProspective Students\\nResearchers\\nSports Fans\\nVisitors & Neighbors\\n\\n\\n\\n\\n \\n\\n\\n\\nQuick Links\\n\\nPeopleFinder\\nLibraries\\nNews\\nCalendar\\nMaps\\nA-Z\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n©\\n Vanderbilt University · All rights reserved. Site Development: Digital Strategies (Division of Communications)\\nVanderbilt University is committed to principles of equal opportunity and affirmative action. Accessibility information. Vanderbilt®, Vanderbilt University®, V Oak Leaf Design®, Star V Design® and Anchor Down® are trademarks of The Vanderbilt University', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq', 'title': 'Frequently Asked Questions | Undergraduate Data Science | Vanderbilt University', 'description': 'Frequently Asked Questions. ', 'language': 'en'})]"
502
+ ]
503
+ },
504
+ "execution_count": null,
505
+ "metadata": {},
506
+ "output_type": "execute_result"
507
+ }
508
+ ],
509
+ "source": [
510
+ "# get the results\n",
511
+ "res_web = website_to_text_web(demo_urls)\n",
512
+ "\n",
513
+ "res_web"
514
+ ]
515
+ },
516
+ {
517
+ "cell_type": "code",
518
+ "execution_count": null,
519
+ "metadata": {},
520
+ "outputs": [],
521
+ "source": [
522
+ "#unit testbed\n",
523
+ "test_converters_inputs(website_to_text_web, demo_urls)"
524
+ ]
525
+ },
526
+ {
527
+ "cell_type": "markdown",
528
+ "metadata": {},
529
+ "source": [
530
+ "Something interesting that we notice here is the proliferation of new lines that aren't for the best.\n",
531
+ "\n",
532
+ "#### Langchain: UnstructuredURLLoader"
533
+ ]
534
+ },
535
+ {
536
+ "cell_type": "code",
537
+ "execution_count": null,
538
+ "metadata": {},
539
+ "outputs": [],
540
+ "source": [
541
+ "#| export\n",
542
+ "def website_to_text_unstructured(web_urls, chunk_size = 1500, chunk_overlap=100):\n",
543
+ "\n",
544
+ " # Make sure it's a list\n",
545
+ " if not isinstance(web_urls, list):\n",
546
+ " web_urls = [web_urls]\n",
547
+ " \n",
548
+ " # Url can be a single string or list\n",
549
+ " website_loader = UnstructuredURLLoader(web_urls)\n",
550
+ " website_raw = website_loader.load()\n",
551
+ "\n",
552
+ " website_data = rawtext_to_doc_split(website_raw, chunk_size = chunk_size, chunk_overlap=chunk_overlap)\n",
553
+ " \n",
554
+ " # Return individual docs or list\n",
555
+ " return website_data"
556
+ ]
557
+ },
558
+ {
559
+ "cell_type": "code",
560
+ "execution_count": null,
561
+ "metadata": {},
562
+ "outputs": [
563
+ {
564
+ "data": {
565
+ "text/plain": [
566
+ "[Document(page_content=\"Menu\\n\\nESPN\\n\\nSearch\\n\\n\\n\\nscores\\n\\nNFL\\n\\nMLB\\n\\nNBA\\n\\nNHL\\n\\nSoccer\\n\\nGolf\\n\\n…Women's World CupNCAAFNCAAMNCAAWSports BettingBoxingCFLNCAACricketF1HorseMMANASCARNBA G LeagueOlympic SportsPLLRacingRN BBRN FBRugbyTennisWNBAWWEX GamesXFL\\n\\nMore ESPN\\n\\nFantasy\\n\\nListen\\n\\nWatch\\n\\nESPN+\\n\\nSUBSCRIBE NOW\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\nPGA TOUR LIVE\\n\\nLittle League Baseball: Regionals\\n\\nMLB: Select Games\\n\\nCrossFit Games\\n\\nSlamBall\\n\\nThe Ultimate Fighter: Season 31\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\nQuick Links\\n\\nWomen's World Cup\\n\\nNHL Free Agency\\n\\nNBA Free Agency Buzz\\n\\nNBA Trade Machine\\n\\nThe Basketball Tournament\\n\\nFantasy Football: Sign Up\\n\\nHow To Watch PGA TOUR\\n\\nFavorites\\n\\nManage Favorites\\n\\nCustomize ESPN\\n\\nESPN Sites\\n\\nESPN Deportes\\n\\nAndscape\\n\\nespnW\\n\\nESPNFC\\n\\nX Games\\n\\nSEC Network\\n\\nESPN Apps\\n\\nESPN\\n\\nESPN Fantasy\\n\\nFollow ESPN\\n\\nFacebook\\n\\nX/Twitter\\n\\nInstagram\\n\\nSnapchat\\n\\nTikTok\\n\\nYouTube\\n\\nHow can your team win the national title? Connelly breaks down what needs to go right for all 17 contendersThe fewer things that have to go right to win a title, the better a team's chances of taking the crown. Here's what has to fall each contender's way.7hBill ConnellyDale Zanine/USA TODAY Sports\\n\\nPosition U 2023: Is USC on the verge of taking over QBU from Oklahoma?Which schools produce the most talent at each position?1dDavid Hale\\n\\nConnelly's conference previews: Intel on all 133 FBS teams\\n\\nTOP HEADLINES\\n\\nFreeze 'uncomfortable' as Auburn opens camp\\n\\nTexans' Metchie relied on faith amid cancer fight\", metadata={'source': 'https://www.espn.com/'}),\n",
567
+ " Document(page_content=\"TOP HEADLINES\\n\\nFreeze 'uncomfortable' as Auburn opens camp\\n\\nTexans' Metchie relied on faith amid cancer fight\\n\\nHornets have new owners after MJ sale finalized\\n\\nMiami coach expects rough treatment of Messi\\n\\nDrexel basketball player found dead in apartment\\n\\nGermany exits WWC after draw with South Korea\\n\\nBrady takes minority stake in English soccer team\\n\\nDeep dish: Cubs' output at plate best since 1897\\n\\nRe-drafting 2018 NFL class 5 years later\\n\\nWHAT HAPPENED IN INDY?\\n\\nInside the shocking feud between Jonathan Taylor and the ColtsHe was the NFL's leading rusher two seasons ago and wanted an extension with the Colts, but now he wants out. How things got so bad for Taylor and Indianapolis.8hStephen HolderZach Bolinger/Icon Sportswire\\n\\n'THE BEST IN THE WORLD RIGHT NOW'\\n\\nWhy Stephen A. is convinced Tyreek Hill is the NFL's top WR\\n\\n2h\\n\\n2:57\\n\\nWYNDHAM CHAMPIONSHIP\\n\\nCONTINUES THROUGH SUNDAY\\n\\nShane Lowry fluffs shot, drains birdie chip immediately after\\n\\n4h\\n\\n0:35\\n\\nCountdown to FedEx Cup Playoffs, AIG Open and the Ryder Cup\\n\\nDiana Taurasi, 10,000 points and the shot that made WNBA scoring history1d\\n\\nMLB SCOREBOARDTHURSDAY'S GAMES\\n\\nSee All\\n\\nTrivia: Can you guess the right player?\\n\\nHERE COMES HELP\\n\\nBring on the reinforcements! 10 returning players as good as a trade deadline blockbusterInjured stars expected to come off the IL soon -- or have already -- could rock MLB's playoff races.7hAlden GonzalezJay Biggerstaff-USA TODAY Sports\\n\\n'CLEARLY THE ACC IS STRUGGLING'\", metadata={'source': 'https://www.espn.com/'}),\n",
568
+ " Document(page_content=\"'CLEARLY THE ACC IS STRUGGLING'\\n\\nFinebaum: FSU is better off leaving the ACC\\n\\n5h\\n\\n1:04\\n\\nThamel's realignment buzz: Latest on Pac-12, Big 12 and ACC\\n\\nAN AGGRESSIVE STRATEGY\\n\\nHow the Big 12 landed Colorado and shook up college footballThe Big 12 learned lessons two years ago after getting burned by Texas and Oklahoma. It resulted in a more aggressive strategy that could dramatically change the sport.2dHeather DinichRaymond Carlin/Icon Sportswire\\n\\nTop Headlines\\n\\nFreeze 'uncomfortable' as Auburn opens camp\\n\\nTexans' Metchie relied on faith amid cancer fight\\n\\nHornets have new owners after MJ sale finalized\\n\\nMiami coach expects rough treatment of Messi\\n\\nDrexel basketball player found dead in apartment\\n\\nGermany exits WWC after draw with South Korea\\n\\nBrady takes minority stake in English soccer team\\n\\nDeep dish: Cubs' output at plate best since 1897\\n\\nRe-drafting 2018 NFL class 5 years later\\n\\nFavorites\\n\\nFantasy\\n\\nManage Favorites\\n\\nFantasy Home\\n\\nCustomize ESPN\\n\\nICYMI\\n\\n0:54\\n\\nSerena Williams, Alexis Ohanian use drones to reveal gender of 2nd childSerena Williams and her husband Alexis Ohanian find out the gender of their second child in a spectacular display of drones.\\n\\nBest of ESPN+\\n\\nTodd Kirkland/Getty Images\\n\\nMLB 2023 trade deadline: Winners, losers and in-betweenersThe 2023 trade deadline is over! Who crushed it, and who left much to be desired? We weigh in on all 30 clubs.\\n\\nAP Photo/Matt York\", metadata={'source': 'https://www.espn.com/'}),\n",
569
+ " Document(page_content='AP Photo/Matt York\\n\\nLowe: Why Bradley Beal could unlock KD, Book and the most dangerous version of the Phoenix Suns yetWith Kevin Durant, Devin Booker and Beal, Phoenix is already an inner-circle title contender. But if the Suns continue a Beal experiment the Wizards ran last season? Good luck.\\n\\nCliff Welch/Icon Sportswire\\n\\nPredicting 10 NFL starting quarterback battles: Who is QB1?We talked to people around the NFL and projected the QB1 for 10 unsettled situations, including a wide-open race in Tampa Bay.\\n\\nTrending Now\\n\\nAP Photo/Julio Cortez\\n\\n\\'Revis Island\\' resonates long after Hall of Famer\\'s retirementDarrelle Revis made his name as a dominant corner but might be best known for his \"island\" moniker players still adopt today.\\n\\nIllustration by ESPN\\n\\nThe wild life of Gardner MinshewFour colleges, three NFL teams, two Manias and the hug that broke the internet. It\\'s been an unbelievable ride for Gardner Minshew. Next stop: Indianapolis.\\n\\nIllustration by ESPN\\n\\nBest 2023 Women\\'s World Cup players: Morgan, Caicedo, moreESPN\\'s expert panel selected the top 25 players of the Women\\'s World Cup to keep an eye on, from Sophia Smith to Sam Kerr and more.\\n\\nHow to Watch on ESPN+\\n\\n(AP Photo/Koji Sasahara, File)\\n\\nHow to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN, ESPN+Here\\'s everything you need to know about how to watch the PGA Tour, Masters, PGA Championship and FedEx Cup playoffs on ESPN and ESPN+.\\n\\nSign up to play the #1 Fantasy game!', metadata={'source': 'https://www.espn.com/'}),\n",
570
+ " Document(page_content=\"Sign up to play the #1 Fantasy game!\\n\\nCreate A League\\n\\nJoin Public League\\n\\nReactivate\\n\\nMock Draft Now\\n\\nSign up for FREE!\\n\\nCreate A League\\n\\nJoin a Public League\\n\\nReactivate a League\\n\\nPractice With a Mock Draft\\n\\nSign up for FREE!\\n\\nCreate A League\\n\\nJoin a Public League\\n\\nReactivate a League\\n\\nPractice with a Mock Draft\\n\\nESPN+\\n\\nWatch Now\\n\\nPaul vs. Diaz (ESPN+ PPV)\\n\\nPGA TOUR LIVE\\n\\nLittle League Baseball: Regionals\\n\\nMLB: Select Games\\n\\nCrossFit Games\\n\\nSlamBall\\n\\nThe Ultimate Fighter: Season 31\\n\\nFantasy Football: Top Storylines, Rookies, Sleepers\\n\\nQuick Links\\n\\nWomen's World Cup\\n\\nNHL Free Agency\\n\\nNBA Free Agency Buzz\\n\\nNBA Trade Machine\\n\\nThe Basketball Tournament\\n\\nFantasy Football: Sign Up\\n\\nHow To Watch PGA TOUR\\n\\nESPN Sites\\n\\nESPN Deportes\\n\\nAndscape\\n\\nespnW\\n\\nESPNFC\\n\\nX Games\\n\\nSEC Network\\n\\nESPN Apps\\n\\nESPN\\n\\nESPN Fantasy\\n\\nFollow ESPN\\n\\nFacebook\\n\\nX/Twitter\\n\\nInstagram\\n\\nSnapchat\\n\\nTikTok\\n\\nYouTube\\n\\nTerms of Use\\n\\nPrivacy Policy\\n\\nYour US State Privacy Rights\\n\\nChildren's Online Privacy Policy\\n\\nInterest-Based Ads\\n\\nAbout Nielsen Measurement\\n\\nDo Not Sell or Share My Personal Information\\n\\nContact Us\\n\\nDisney Ad Sales Site\\n\\nWork for ESPN\", metadata={'source': 'https://www.espn.com/'}),\n",
571
+ " Document(page_content='Skip to main content\\n\\nlink\\n\\nHome\\n\\nPeople\\n\\nMinor\\n\\n\\tMinor Requirements\\n\\tCourse Descriptions\\n\\tCourse Schedule\\n\\tHow to Declare the Minor\\n\\tChoosing a Minor\\n\\nResearch and Immersion\\n\\n\\tResearch and Immersion Overview\\n\\tDS 3850 Research in Data Science\\n\\tDSI Summer Research Program\\n\\tData Science for Social Good\\n\\tResearch Immersion in Data Science\\n\\tDSI Internship\\n\\nFAQ\\n\\nNews\\n\\nForms\\n\\nContact and Email List\\n\\nData Science Institute\\n\\nUndergraduate Data Science\\n\\nFrequently Asked Questions\\n\\nDeclaring the Minor\\n\\nHow do I declare the Data Science Minor?\\n\\nUse the forms and follow the procedures for your home college. See How to Declare the Data Science Minor.\\n\\nWhen should I declare the Data Science Minor?\\n\\nWhile minor declarations can be made any time, DS courses will give some preference to students who have officially declared the Data Science Minor. So we recommend declaring the minor sooner rather than later. It is always possible to drop a declared minor. Minor declarations must be submitted at least two weeks before registration begins. Otherwise, the minor declaration will not be processed until after registration. No preference will be given during registration for an “intent” to declare because the minor declaration was made too late.\\n\\nI declared the Data Science Minor, but I did not get into the class I wanted to take for the minor. Why?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
572
+ " Document(page_content='I declared the Data Science Minor, but I did not get into the class I wanted to take for the minor. Why?\\n\\nFirst, preference for students who have declared the minor only applies to DS courses, not other courses. Second, if you declared the minor within two weeks of registration, your minor declaration will. not show up on YES, and you will not have preference. Third, while we try to hold as many seats for students who have declared the minor as we can, not all seats are reserved.\\n\\nI am a first-year A&S student. Can I really declare the Data Science Minor now?\\n\\nYes. While A&S students are usually prevented from declaring a major or minor until sophomore year, first-year A&S students can declare the Data Science Minor. As noted in the previous question, this can be important to do since some popular core DS courses will give some preference to students who have officially declared Data Science as a minor.\\n\\nI am a current junior (rising senior), can I complete the Data Science Minor (for Spring 2021 juniors only)?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
573
+ " Document(page_content='I am a current junior (rising senior), can I complete the Data Science Minor (for Spring 2021 juniors only)?\\n\\nJuniors must contact the Director of Undergraduate Data Science to discuss options. DS 1000 is not open to current juniors (rising seniors). DS 3100 will not be taught next year (Fall 2021 or Spring 2022) and will need to be suitably replaced, which will require an approved plan from the Director. Furthermore, while DS / CS 3262 is current slated to be taught Spring 2022, that is not fully guaranteed, so students should see if they can take one of the other machine learning options.\\n\\nI am a rising senior or current senior and cannot register for DS 1000. Why?\\n\\nRising seniors and current seniors can only register for DS 1000 if there are available seats immediately before the semester begins with permission of the instructor. DS 1000 is intended as an introduction to data science for first years and sophomores, which is why this restriction is in place.\\n\\nCollege-Specific Information\\n\\nWhat college is the home of the Data Science Minor?\\n\\nThe Data Science Minor is a trans-institutional minor, shared by A&S, Blair, Engineering, and Peabody.\\n\\nI am an A&S student. Do DS courses count as A&S courses?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
574
+ " Document(page_content='I am an A&S student. Do DS courses count as A&S courses?\\n\\nAll courses with a DS prefix count as courses within each of the colleges, including A&S. If you are an A&S student, and are taking a course that is cross-listed, make sure you enroll in the one with the DS prefix. Electives outside of A&S without the DS prefix will generally not count as A&S courses, so plan accordingly.\\n\\nWhat are the unique credit hour rules for the Data Science Minor?\\n\\nStudents electing an undergraduate minor in Data Science must follow academic regulations regarding minors in their home college, including but not limited to regulations regarding unique hours. The unique credit hour rule is specific to the College of Arts and Science and Peabody College. The School of Engineering and Blair School of Music do not have a unique credit hour rule. The Data Science minor cannot waive this rule. Please talk with your academic advisor about how to satisfy these requirements.\\n\\nInfo About the Courses\\n\\nDS 1000\\n\\nThank you for your interest in DS 1000! The course is full for the fall 2021 semester. Due to student demand and the transinstitutional nature of the course, we cannot make special exceptions as to which students, if any, on the waitlist are able to enroll. DS 1000 will be offered again in the spring semester.\\n\\nWhat computer programming course should I take?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
575
+ " Document(page_content='What computer programming course should I take?\\n\\nSee What Programming Course To Take? In general, students interested in data science and scientific computing (not in computer science per se) should learn Python (and R).\\n\\nHow do I find courses approved for the data science minor on YES?\\n\\nOn YES, to select all courses approved for credit in the Data Science minor offered in a given semester, select the “Advanced” link next to the search box, select the “Class Attributes” drop-down box on the bottom right of the advanced search page, and then select “Eligible for Data Science” to find all courses. (Note that these course tags will not all be in place on YES until the registration period for Fall 2021 begins.)\\n\\nCan other courses, besides those listed, count towards the Data Science Minor?\\n\\nNew courses, special topics courses, or graduate-level courses that seem related to data science could count as electives. Contact the Director of Undergraduate Data Science to request consideration.\\n\\nWhy doesn’t CS 1104 count towards the Data Science Minor?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
576
+ " Document(page_content='Why doesn’t CS 1104 count towards the Data Science Minor?\\n\\nIt does, as a prerequisite to CS 2204, which counts towards the minor. CS / DS 1100 was created as a new single-semester programming course for the Data Science Minor. It roughly has 2/3 the content of CS 1104 and 1/3 the content of CS 2204. While CS / DS 1100 counts as a single semester of programming for the minor, we strongly encourage students interested in data science, and in using data science tools and techniques, to take two semesters of programming in Python (CS / DS 1100 or CS 1104, followed by CS 2204). If you have taken CS 1104, you can take CS 1100, but you will only receive a total of four credits for\\xa0the two courses. See also What Programming Course To Take?\\n\\nI see that after having taken CS 1104, I can take CS/DS 1100 instead of taking CS 2204. What are the downsides of doing so?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
577
+ " Document(page_content='I see that after having taken CS 1104, I can take CS/DS 1100 instead of taking CS 2204. What are the downsides of doing so?\\n\\nAfter taking CS 1104, we do recommend you take CS 2204. If you are interested in data science, a broader experience in Python in desirable (in fact, we recommend that students having taken CS 1100 try to take CS 2204 as well). CS/DS 1100 and 1104 have significant overlap (both are introductions to programming using Python). That said, it is permissible to take CS/DS 1100 after having taken CS 1104. You will only get 1 (out of 3) credit hours for CS/DS 1100 (after having taken CS 1104), but the combination of CS/DS 1100 and 1104 will satisfy the DS minor programming requirement. Note that if you enroll in three 3-hour courses and CS/DS 1100 (after having taken CS 1104) it will look like you are registered for 12 credit hours during registration and at the start of the semester, but your credit hours will be reduced to only 10 credit hours (because the credits for CS/DS 1100 will be cut back to 1 after the add/drop period). Enrolling in fewer than 12 credit hours can have significant consequences on financial aid and potentially on visa status for international students. Please be mindful of this.\\n\\nWhat is the difference between CS 1100 and DS 1100?\\n\\nNothing. They are the same course. They meet the same time in the same place and are taught by the same instructor. They are just cross-listed.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
578
+ " Document(page_content='Nothing. They are the same course. They meet the same time in the same place and are taught by the same instructor. They are just cross-listed.\\n\\nI have taken CS 1101. What computer programming course should I take next?\\n\\nYou have two options. You can either take CS 2201 (in C++) or take CS 1100 (in Python). Of course, you could also take CS 1104 and 2204 (in Python). CS 1100, 2201, and 2204 all satisfy the programming requirement for the minor. Note that CS 2201 is a prerequisite for many upper-level CS courses (as well as required for the CS major and minor). For more information, see What Programming Course To Take?\\n\\nECON 3750 and MATH 3670 are listed both as satisfying the core machine learning requirement and as electives. If I take one, will it double-count for both requirements?\\n\\nNo. They are listed under both because a student who takes one of the other machine learning\\xa0courses to satisfy the core requirement (CS/DS 3262 or CS 4262) can also take ECON 3750 or MATH 3670 as an elective; the content is sufficiently different that both can count towards the minor, but one course cannot double-count for two minor requirements.\\n\\nCan I take ECON 3750 or MATH 3670 as an elective if I have already taken CS 3262 or CS 4262?', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
579
+ " Document(page_content='Can I take ECON 3750 or MATH 3670 as an elective if I have already taken CS 3262 or CS 4262?\\n\\nYes (see above). ECON 3750 and MATH 3670 are sufficiently different from CS 3262 or CS 4262 (and from each other) that you can take these as electives. In fact, you could take ECON 3750 to satisfy the machine learning requirement and then take MATH 3670 as an elective.\\n\\nCS 3262 can count towards the Data Science minor. CS 3262 does not count directly towards the Computer Science major requirements but could be used as either a tech elective or open elective for Computer Science majors.\\n\\nWhy doesn’t MATH 2820 count towards the Data Science Minor?\\n\\nIt does, as a prerequisite to MATH 2821, which counts towards the minor. The two-course sequence of MATH 2820 and MATH 2821 counts towards the Data Science Minor; the\\xa0two-course sequence is required because MATH 2820 goes deep into mathematical foundations of probability ad statistics concepts, but does not by itself cover the breadth of topics of other introductory statistics courses. This two-course sequence provides an excellent introduction to mathematical statistics.\\n\\nResearch and Immersion Information\\n\\nCan I do research for course credit?\\n\\nYes, you can do research for course credit (including DS 3850). More information can be found here: https://www.vanderbilt.edu/undergrad-datascience/ds-3850-research-in-data-science/\\n\\nI am interested in the Undergraduate Data Science Immersion Program. How can I participate.', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
580
+ " Document(page_content='I am interested in the Undergraduate Data Science Immersion Program. How can I participate.\\n\\nSome competitive summer immersion programs include DSI-SPR and Data Science for Social Good (DSSG). More information can be found on the following websites.\\n\\nhttps://www.vanderbilt.edu/datascience/academics/undergraduate/summer-research-program/\\n\\nhttps://www.vanderbilt.edu/datascience/data-science-for-social-good/\\n\\nTo get involved in data-science-oriented research with a faculty member, you will need to reach out to the faculty member. Pointers can be found here: https://www.vanderbilt.edu/undergrad-datascience/research-and-immersion-overview/. Having that research count towards the immersion requirement will be between your faculty mentor and your faculty immersion coordinator.\\n\\nAdditional information about research opportunities will be posted on the website in the future.\\n\\nContact\\n\\nHow do I ask a question about the Data Science?\\n\\nIf you have questions about the Data Science Minor or Immersion opportunities in data science, please email us: undergrad.datascience@vanderbilt.edu\\n\\nTo whom can I petition if the Director denies my request?\\n\\nThe Governing Board of the Data Science Minor acts as the college-level oversight body for this trans-institutional minor and would be the appropriate next step for petitions related to the minor.\\n\\nData Science News\\n\\nOpportunities for Capstone Projects and Research Experience\\n\\nAttention Graduate Students! We’re Hiring!', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'}),\n",
581
+ " Document(page_content='Data Science News\\n\\nOpportunities for Capstone Projects and Research Experience\\n\\nAttention Graduate Students! We’re Hiring!\\n\\nVanderbilt student-athlete drives sports performance through data analysis\\n\\nNew Course: DS 3891 Special Topics: Intro to Generative AI\\n\\nNow Accepting Applications: DS Minor Teaching Fellowship for graduate students\\n\\nJoin Our Team: Student Worker Positions Available for Fall 2023 Semester!\\n\\nVIEW MORE EVENTS >\\n\\nYour Vanderbilt\\n\\nAlumni\\n\\nCurrent Students\\n\\nFaculty & Staff\\n\\nInternational Students\\n\\nMedia\\n\\nParents & Family\\n\\nProspective Students\\n\\nResearchers\\n\\nSports Fans\\n\\nVisitors & Neighbors\\n\\nQuick Links\\n\\nPeopleFinder\\n\\nLibraries\\n\\nNews\\n\\nCalendar\\n\\nMaps\\n\\nA-Z\\n\\n©\\n Site Development: Digital Strategies (Division of Communications)\\n Vanderbilt University is committed to principles of equal opportunity and affirmative action. Accessibility information. Vanderbilt®, Vanderbilt University®, V Oak Leaf Design®, Star V Design® and Anchor Down® are trademarks of The Vanderbilt University', metadata={'source': 'https://www.vanderbilt.edu/undergrad-datascience/faq'})]"
582
+ ]
583
+ },
584
+ "execution_count": null,
585
+ "metadata": {},
586
+ "output_type": "execute_result"
587
+ }
588
+ ],
589
+ "source": [
590
+ "# get the results\n",
591
+ "res_unstructured = website_to_text_unstructured(demo_urls)\n",
592
+ "res_unstructured"
593
+ ]
594
+ },
595
+ {
596
+ "cell_type": "code",
597
+ "execution_count": null,
598
+ "metadata": {},
599
+ "outputs": [],
600
+ "source": [
601
+ "#unit testb\n",
602
+ "test_converters_inputs(website_to_text_unstructured, demo_urls)"
603
+ ]
604
+ },
605
+ {
606
+ "cell_type": "markdown",
607
+ "metadata": {},
608
+ "source": [
609
+ "We also see here that there's something to be said about the unstructured approach which appears to be more conservative in the number of newline characters but still appears to preserve content. However, the gain is not overly significant.\n",
610
+ "\n",
611
+ "#### Trafilatura Parsing\n",
612
+ "\n",
613
+ "[Tralifatura](https://trafilatura.readthedocs.io/en/latest/) is a Python and command-line utility which attempts to extracts the most relevant information from a given website. "
614
+ ]
615
+ },
616
+ {
617
+ "cell_type": "code",
618
+ "execution_count": null,
619
+ "metadata": {},
620
+ "outputs": [],
621
+ "source": [
622
+ "def website_trafilatura(url):\n",
623
+ " downloaded = trafilatura.fetch_url(url)\n",
624
+ " return trafilatura.extract(downloaded)"
625
+ ]
626
+ },
627
+ {
628
+ "cell_type": "code",
629
+ "execution_count": null,
630
+ "metadata": {},
631
+ "outputs": [
632
+ {
633
+ "name": "stdout",
634
+ "output_type": "stream",
635
+ "text": [
636
+ "Total number of characters in example: 1565 \n",
637
+ "\n"
638
+ ]
639
+ },
640
+ {
641
+ "data": {
642
+ "text/plain": [
643
+ "'|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nPHI\\nMIA\\n||\\n56-49\\n57-49\\n||\\n||\\n||\\n||\\n||\\n6:40 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nMIL\\nWSH\\n||\\n57-49\\n44-62\\n||\\n||\\n||\\n||\\n||\\n7:05 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nTB\\nNYY\\n||\\n64-44\\n55-50\\n||\\n||\\n||\\n||\\n||\\n7:05 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nBAL\\nTOR\\n||\\n64-41\\n59-47\\n||\\n||\\n||\\n||\\n||\\n7:07 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nLAA\\nATL\\n||\\n55-51\\n67-36\\n||\\n||\\n||\\n||\\n||\\n7:20 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCIN\\nCHC\\n||\\n58-49\\n53-52\\n||\\n||\\n||\\n||\\n||\\n8:05 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCLE\\nHOU\\n||\\n53-53\\n59-47\\n||\\n||\\n||\\n||\\n||\\n8:10 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nSD\\nCOL\\n||\\n52-54\\n41-64\\n||\\n||\\n||\\n||\\n||\\n8:40 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nBOS\\nSEA\\n||\\n56-49\\n54-51\\n||\\n||\\n||\\n||\\n||\\n9:40 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nARI\\nSF\\n||\\n56-50\\n58-48\\n||\\n||\\n||\\n||\\n||\\n9:45 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nJPN\\nESP\\n||\\n4\\n0\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCRC\\nZAM\\n||\\n1\\n3\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCAN\\nAUS\\n||\\n0\\n4\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nIRL\\nNGA\\n||\\n0\\n0\\n||\\n||\\n||\\n||\\n||\\nFT\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nPOR\\nUSA\\n||\\nWLDDL\\nDWWWW\\n||\\n||\\n||\\n||\\n||\\n3:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nVIE\\nNED\\n||\\nLLLLL\\nDWWWL\\n||\\n||\\n||\\n||\\n||\\n3:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nCHN\\nENG\\n||\\nWLDWW\\nWWDLW\\n||\\n||\\n||\\n||\\n||\\n7:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nHAI\\nDEN\\n||\\nLLLWL\\nLWLWW\\n||\\n||\\n||\\n||\\n||\\n7:00 AM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nAME\\nCLB\\n||\\nWWLLW\\nWLDDW\\n||\\n||\\n||\\n||\\n||\\n8:00 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nPUE\\nCHI\\n||\\nLLLDL\\nWWWWL\\n||\\n||\\n||\\n||\\n||\\n8:00 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nTOL\\nCOL\\n||\\nWLWDW\\nLDDWL\\n||\\n||\\n||\\n||\\n||\\n9:30 PM ET\\n||\\n|\\n|\\n|\\n|\\n|\\n|\\n||\\n|\\n|\\n||\\nGDL\\nSKC\\n||\\nLWWWL\\nLLDDW\\n||\\n||\\n||\\n||\\n||\\n10:00 PM ET\\n||\\n|\\n|\\n|'"
644
+ ]
645
+ },
646
+ "execution_count": null,
647
+ "metadata": {},
648
+ "output_type": "execute_result"
649
+ }
650
+ ],
651
+ "source": [
652
+ "trafilatura_text = website_trafilatura(demo_urls[0])\n",
653
+ "print('Total number of characters in example:', len(trafilatura_text), '\\n')\n",
654
+ "trafilatura_text"
655
+ ]
656
+ },
657
+ {
658
+ "cell_type": "markdown",
659
+ "metadata": {},
660
+ "source": [
661
+ "This output is SUBSTANTIALLY shorter with a length of 1565 characters. However, the problem is that the main article on the page actually isn't captured at all.\n",
662
+ "\n",
663
+ "#### jusText\n",
664
+ "\n",
665
+ "[jusText](https://pypi.org/project/jusText/) is another Python library for extracting content from a website."
666
+ ]
667
+ },
668
+ {
669
+ "cell_type": "code",
670
+ "execution_count": null,
671
+ "metadata": {},
672
+ "outputs": [],
673
+ "source": [
674
+ "def website_justext(url):\n",
675
+ " response = requests.get(url)\n",
676
+ " paragraphs = justext.justext(response.content, justext.get_stoplist(\"English\"))\n",
677
+ " content = [paragraph.text for paragraph in paragraphs \\\n",
678
+ " if not paragraph.is_boilerplate]\n",
679
+ " text = \" \".join(content)\n",
680
+ " return text"
681
+ ]
682
+ },
683
+ {
684
+ "cell_type": "code",
685
+ "execution_count": null,
686
+ "metadata": {},
687
+ "outputs": [
688
+ {
689
+ "data": {
690
+ "text/plain": [
691
+ "''"
692
+ ]
693
+ },
694
+ "execution_count": null,
695
+ "metadata": {},
696
+ "output_type": "execute_result"
697
+ }
698
+ ],
699
+ "source": [
700
+ "# Ensure behavior\n",
701
+ "justext_text = website_justext(demo_urls[0])\n",
702
+ "justext_text"
703
+ ]
704
+ },
705
+ {
706
+ "cell_type": "code",
707
+ "execution_count": null,
708
+ "metadata": {},
709
+ "outputs": [
710
+ {
711
+ "data": {
712
+ "text/plain": [
713
+ "'Declaring the Minor While minor declarations can be made any time, DS courses will give some preference to students who have officially declared the Data Science Minor. So we recommend declaring the minor sooner rather than later. It is always possible to drop a declared minor. Minor declarations must be submitted at least two weeks before registration begins. Otherwise, the minor declaration will not be processed until after registration. No preference will be given during registration for an “intent” to declare because the minor declaration was made too late. First, preference for students who have declared the minor only applies to DS courses, not other courses. Second, if you declared the minor within two weeks of registration, your minor declaration will. not show up on YES, and you will not have preference. Third, while we try to hold as many seats for students who have declared the minor as we can, not all seats are reserved. Yes. While A&S students are usually prevented from declaring a major or minor until sophomore year, first-year A&S students can declare the Data Science Minor. As noted in the previous question, this can be important to do since some popular core DS courses will give some preference to students who have officially declared Data Science as a minor. Juniors must contact the Director of Undergraduate Data Science to discuss options. DS 1000 is not open to current juniors (rising seniors). DS 3100 will not be taught next year (Fall 2021 or Spring 2022) and will need to be suitably replaced, which will require an approved plan from the Director. Furthermore, while DS / CS 3262 is current slated to be taught Spring 2022, that is not fully guaranteed, so students should see if they can take one of the other machine learning options. Rising seniors and current seniors can only register for DS 1000 if there are available seats immediately before the semester begins with permission of the instructor. DS 1000 is intended as an introduction to data science for first years and sophomores, which is why this restriction is in place. All courses with a DS prefix count as courses within each of the colleges, including A&S. If you are an A&S student, and are taking a course that is cross-listed, make sure you enroll in the one with the DS prefix. Electives outside of A&S without the DS prefix will generally not count as A&S courses, so plan accordingly. Students electing an undergraduate minor in Data Science must follow academic regulations regarding minors in their home college, including but not limited to regulations regarding unique hours. The unique credit hour rule is specific to the College of Arts and Science and Peabody College. The School of Engineering and Blair School of Music do not have a unique credit hour rule. The Data Science minor cannot waive this rule. Please talk with your academic advisor about how to satisfy these requirements. Info About the Courses Thank you for your interest in DS 1000! The course is full for the fall 2021 semester. Due to student demand and the transinstitutional nature of the course, we cannot make special exceptions as to which students, if any, on the waitlist are able to enroll. DS 1000 will be offered again in the spring semester. On YES, to select all courses approved for credit in the Data Science minor offered in a given semester, select the “Advanced” link next to the search box, select the “Class Attributes” drop-down box on the bottom right of the advanced search page, and then select “Eligible for Data Science” to find all courses. (Note that these course tags will not all be in place on YES until the registration period for Fall 2021 begins.) It does, as a prerequisite to CS 2204, which counts towards the minor. CS / DS 1100 was created as a new single-semester programming course for the Data Science Minor. It roughly has 2/3 the content of CS 1104 and 1/3 the content of CS 2204. While CS / DS 1100 counts as a single semester of programming for the minor, we strongly encourage students interested in data science, and in using data science tools and techniques, to take two semesters of programming in Python (CS / DS 1100 or CS 1104, followed by CS 2204). If you have taken CS 1104, you can take CS 1100, but you will only receive a total of four credits for the two courses. See also What Programming Course To Take? After taking CS 1104, we do recommend you take CS 2204. If you are interested in data science, a broader experience in Python in desirable (in fact, we recommend that students having taken CS 1100 try to take CS 2204 as well). CS/DS 1100 and 1104 have significant overlap (both are introductions to programming using Python). That said, it is permissible to take CS/DS 1100 after having taken CS 1104. You will only get 1 (out of 3) credit hours for CS/DS 1100 (after having taken CS 1104), but the combination of CS/DS 1100 and 1104 will satisfy the DS minor programming requirement. Note that if you enroll in three 3-hour courses and CS/DS 1100 (after having taken CS 1104) it will look like you are registered for 12 credit hours during registration and at the start of the semester, but your credit hours will be reduced to only 10 credit hours (because the credits for CS/DS 1100 will be cut back to 1 after the add/drop period). Enrolling in fewer than 12 credit hours can have significant consequences on financial aid and potentially on visa status for international students. Please be mindful of this. You have two options. You can either take CS 2201 (in C++) or take CS 1100 (in Python). Of course, you could also take CS 1104 and 2204 (in Python). CS 1100, 2201, and 2204 all satisfy the programming requirement for the minor. Note that CS 2201 is a prerequisite for many upper-level CS courses (as well as required for the CS major and minor). For more information, see What Programming Course To Take? No. They are listed under both because a student who takes one of the other machine learning courses to satisfy the core requirement (CS/DS 3262 or CS 4262) can also take ECON 3750 or MATH 3670 as an elective; the content is sufficiently different that both can count towards the minor, but one course cannot double-count for two minor requirements. Yes (see above). ECON 3750 and MATH 3670 are sufficiently different from CS 3262 or CS 4262 (and from each other) that you can take these as electives. In fact, you could take ECON 3750 to satisfy the machine learning requirement and then take MATH 3670 as an elective. CS 3262 can count towards the Data Science minor. CS 3262 does not count directly towards the Computer Science major requirements but could be used as either a tech elective or open elective for Computer Science majors. It does, as a prerequisite to MATH 2821, which counts towards the minor. The two-course sequence of MATH 2820 and MATH 2821 counts towards the Data Science Minor; the two-course sequence is required because MATH 2820 goes deep into mathematical foundations of probability ad statistics concepts, but does not by itself cover the breadth of topics of other introductory statistics courses. This two-course sequence provides an excellent introduction to mathematical statistics.'"
714
+ ]
715
+ },
716
+ "execution_count": null,
717
+ "metadata": {},
718
+ "output_type": "execute_result"
719
+ }
720
+ ],
721
+ "source": [
722
+ "# Try a different URL to see if behavior improves\n",
723
+ "justext_text = website_justext(demo_urls[1])\n",
724
+ "justext_text"
725
+ ]
726
+ },
727
+ {
728
+ "cell_type": "markdown",
729
+ "metadata": {},
730
+ "source": [
731
+ "Here, we see that we may prefer to stick with the langchain implementations. The first jusText example returned an empty string, although previous work demonstrates that on a different day, it worked well (note that the ESPN's content was different). With the second URL, parts of the website, particularly the headers, is actually missing."
732
+ ]
733
+ },
734
+ {
735
+ "cell_type": "markdown",
736
+ "metadata": {},
737
+ "source": [
738
+ "## Creating Document Segments\n",
739
+ "Now, the precursor to creating vector stores/embeddings is to create document segments. Since we have a variety of sources, we will keep this in mind as we develop the following function.\n",
740
+ "\n",
741
+ ":::{.callout-warning}\n",
742
+ "Note that the `get_document_segments` currently is meant to be used in one single pass with `context_info` being all of a single file type. [Issue #150](https://github.com/vanderbilt-data-science/lo-achievement/issues/150) is meant to expand this functionality so that if many files are uploaded, the software will be able to handle this.\n",
743
+ ":::"
744
+ ]
745
+ },
746
+ {
747
+ "cell_type": "code",
748
+ "execution_count": null,
749
+ "metadata": {},
750
+ "outputs": [],
751
+ "source": [
752
+ "#| export\n",
753
+ "def get_document_segments(context_info, data_type, chunk_size = 1500, chunk_overlap=100):\n",
754
+ "\n",
755
+ " load_fcn = None\n",
756
+ " addtnl_params = {'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}\n",
757
+ "\n",
758
+ " # Define function use to do the loading\n",
759
+ " if data_type == 'text':\n",
760
+ " load_fcn = rawtext_to_doc_split\n",
761
+ " elif data_type == 'web_page':\n",
762
+ " load_fcn = website_to_text_unstructured\n",
763
+ " elif data_type == 'youtube_video':\n",
764
+ " load_fcn = youtube_to_text\n",
765
+ " else:\n",
766
+ " load_fcn = files_to_text\n",
767
+ " \n",
768
+ " # Get the document segments\n",
769
+ " doc_segments = load_fcn(context_info, **addtnl_params)\n",
770
+ "\n",
771
+ " return doc_segments"
772
+ ]
773
+ },
774
+ {
775
+ "cell_type": "markdown",
776
+ "metadata": {},
777
+ "source": [
778
+ "## Creating Vector Stores from Document Segments\n",
779
+ "The last step here will be in the creation of vector stores from the provided document segments. We will allow for the usage of either Chroma or DeepLake and enforce OpenAIEmbeddings."
780
+ ]
781
+ },
782
+ {
783
+ "cell_type": "code",
784
+ "execution_count": null,
785
+ "metadata": {},
786
+ "outputs": [],
787
+ "source": [
788
+ "#| export\n",
789
+ "def create_local_vector_store(document_segments, **retriever_kwargs):\n",
790
+ " embeddings = OpenAIEmbeddings()\n",
791
+ " db = Chroma.from_documents(document_segments, embeddings)\n",
792
+ " retriever = db.as_retriever(**retriever_kwargs)\n",
793
+ " \n",
794
+ " return db, retriever"
795
+ ]
796
+ },
797
+ {
798
+ "cell_type": "markdown",
799
+ "metadata": {},
800
+ "source": [
801
+ "### Unit test of vector store and segment creation"
802
+ ]
803
+ },
804
+ {
805
+ "cell_type": "code",
806
+ "execution_count": null,
807
+ "metadata": {},
808
+ "outputs": [],
809
+ "source": [
810
+ "from langchain.chat_models import ChatOpenAI\n",
811
+ "from getpass import getpass"
812
+ ]
813
+ },
814
+ {
815
+ "cell_type": "code",
816
+ "execution_count": null,
817
+ "metadata": {},
818
+ "outputs": [],
819
+ "source": [
820
+ "openai_api_key = getpass()\n",
821
+ "os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
822
+ "\n",
823
+ "llm = ChatOpenAI(model_name = 'gpt-3.5-turbo-16k')"
824
+ ]
825
+ },
826
+ {
827
+ "cell_type": "code",
828
+ "execution_count": null,
829
+ "metadata": {},
830
+ "outputs": [],
831
+ "source": [
832
+ "test_files = ['../roadnottaken.txt', '../2302.11382.pdf']\n",
833
+ "\n",
834
+ "#get vector store\n",
835
+ "segs = get_document_segments(test_files, data_type='other', chunk_size = 1000, chunk_overlap = 100)\n",
836
+ "chroma_db, vs_retriever = create_local_vector_store(segs)\n",
837
+ "\n",
838
+ "#create test retrievalqa\n",
839
+ "qa_chain = RetrievalQA.from_chain_type(llm=openai_llm, chain_type=\"stuff\", retriever=vs_retriever)"
840
+ ]
841
+ },
842
+ {
843
+ "cell_type": "code",
844
+ "execution_count": null,
845
+ "metadata": {},
846
+ "outputs": [
847
+ {
848
+ "data": {
849
+ "text/plain": [
850
+ "[Document(page_content='Two roads diverged in a yellow wood,\\rAnd sorry I could not travel both\\rAnd be one traveler, long I stood\\rAnd looked down one as far as I could\\rTo where it bent in the undergrowth;\\r\\rThen took the other, as just as fair,\\rAnd having perhaps the better claim,\\rBecause it was grassy and wanted wear;\\rThough as for that the passing there\\rHad worn them really about the same,\\r\\rAnd both that morning equally lay\\rIn leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing how way leads on to way,\\rI doubted if I should ever come back. I shall be telling this with a sigh\\rSomewhere ages and ages hence:\\rTwo roads diverged in a wood, and IэI took the one less traveled by,\\rAnd that has made all the difference.', metadata={'source': '../roadnottaken.txt', 'start_index': 0}),\n",
851
+ " Document(page_content='any unnecessary steps,” is useful in flagging inaccuracies in the user’s original request so that the final recipe is efficient.', metadata={'source': '../2302.11382.pdf', 'start_index': 92662}),\n",
852
+ " Document(page_content='The third statement provides an optional way for the user to stop the output generation process. This step is not always needed, but can be useful in situations where there may be the potential for ambiguity regarding whether or not the user- provided input between inputs is meant as a refinement for the next generation or a command to stop. For example, an explicit stop phrase could be created if the user was generating data related to road signs, where the user might want to enter a refinement of the generation like “stop” to indicate that a stop sign should be added to the output.', metadata={'source': '../2302.11382.pdf', 'start_index': 72043}),\n",
853
+ " Document(page_content='“When I ask you a question, generate three addi- tional questions that would help you give a more accurate answer. Assume that I know little about the topic that we are discussing and please define any terms that are not general knowledge. When I have answered the three questions, combine the answers to produce the final answers to my original question.”\\n\\nOne point of variation in this pattern is where the facts are output. Given that the facts may be terms that the user is not familiar with, it is preferable if the list of facts comes after the output. This after-output presentation ordering allows the user to read and understand the statements before seeing what statements should be checked. The user may also determine additional facts prior to realizing the fact list at the end should be checked.', metadata={'source': '../2302.11382.pdf', 'start_index': 57473})]"
854
+ ]
855
+ },
856
+ "execution_count": null,
857
+ "metadata": {},
858
+ "output_type": "execute_result"
859
+ }
860
+ ],
861
+ "source": [
862
+ "# check for functionality\n",
863
+ "chroma_db.similarity_search('The street was forked and I did not know which way to go')"
864
+ ]
865
+ },
866
+ {
867
+ "cell_type": "code",
868
+ "execution_count": null,
869
+ "metadata": {},
870
+ "outputs": [],
871
+ "source": [
872
+ "#check qa chain for functionality\n",
873
+ "ans = qa_chain({'question':'What is the best prompt to use when I want the model to take on a certain attitude of a person?'})"
874
+ ]
875
+ },
876
+ {
877
+ "cell_type": "code",
878
+ "execution_count": null,
879
+ "metadata": {},
880
+ "outputs": [
881
+ {
882
+ "data": {
883
+ "text/plain": [
884
+ "{'question': 'What is the best prompt to use when I want the model to take on a certain attitude of a person?',\n",
885
+ " 'answer': 'The best prompt to use when you want the model to take on a certain attitude of a person is to provide a persona for the model to embody. This can be expressed as a job description, title, fictional character, historical figure, or any other attributes associated with a well-known type of person. The prompt should specify the outputs that this persona would create. Additionally, personas can also represent inanimate or non-human entities, such as a Linux terminal or a database. In this case, the prompt should specify how the inputs should be delivered to the entity and what outputs the entity should produce. It is also possible to provide a better version of the question and prompt the model to ask if the user would like to use the better version instead.\\n',\n",
886
+ " 'sources': '../2302.11382.pdf',\n",
887
+ " 'source_documents': [Document(page_content='4) Example Implementation: A sample prompt for a flipped\\n\\ninteraction is shown below:\\n\\n“From now on, I would like you to ask me questions to deploy a Python application to AWS. When you have enough information to deploy the application, create a Python script to automate the deployment.”\\n\\n2) Motivation: Users may not know what types of outputs or details are important for an LLM to focus on to achieve a given task. They may know, however, the role or type of person that they would normally ask to get help with these things. The Persona pattern enables the users to express what they need help with without knowing the exact details of the outputs they need.', metadata={'source': '../2302.11382.pdf', 'start_index': 36397}),\n",
888
+ " Document(page_content='ments:\\n\\nContextual Statements Act as persona X Provide outputs that persona X would create\\n\\nThe first statement conveys the idea that the LLM needs to act as a specific persona and provide outputs that such a persona would. This persona can be expressed in a number of ways, ranging from a job description, title, fictional char- acter, historical figure, etc. The persona should elicit a set of attributes associated with a well-known job title, type of person, etc.2\\n\\n5) Consequences: One consideration when designing the prompt is how much to dictate to the LLM regarding what information to collect prior to termination. In the example above, the flipped interaction is open-ended and can vary sig- nificantly in the final generated artifact. This open-endedness makes the prompt generic and reusable, but may potentially ask additional questions that could be skipped if more context is given.', metadata={'source': '../2302.11382.pdf', 'start_index': 37872}),\n",
889
+ " Document(page_content='In this example, the LLM is instructed to provide outputs that a ”security reviewer” would. The prompt further sets the stage that code is going to be evaluated. Finally, the user refines the persona by scoping the persona further to outputs regarding the code.\\n\\nPersonas can also represent inanimate or non-human en- tities, such as a Linux terminal, a database, or an animal’s perspective. When using this pattern to represent these entities, it can be useful to also specify how you want the inputs delivered to the entity, such as “assume my input is what the owner is saying to the dog and your output is the sounds the dog is making”. An example prompt for a non-human entity that uses a “pretend to be” wording is shown below:\\n\\n“You are going to pretend to be a Linux terminal for a computer that has been compromised by an attacker. When I type in a command, you are going the Linux to output terminal would produce.”\\n\\nthe corresponding text\\n\\nthat', metadata={'source': '../2302.11382.pdf', 'start_index': 41330}),\n",
890
+ " Document(page_content='the corresponding text\\n\\nthat\\n\\nThis prompt is designed to simulate a computer that has been compromised by an attacker and is being controlled through a Linux terminal. The prompt specifies that the user will input commands into the terminal, and in response, the simulated terminal will output the corresponding text that would be produced by a real Linux terminal. This prompt is more prescriptive in the persona and asks the LLM to, not only be a Linux terminal, but to further act as a computer that has been compromised by an attacker.\\n\\n3) Structure and Key Ideas: Fundamental contextual state-\\n\\nments:\\n\\nContextual Statements Within scope X, suggest a better version of the question to use instead (Optional) prompt me if I would like to use the better version instead', metadata={'source': '../2302.11382.pdf', 'start_index': 42256})]}"
891
+ ]
892
+ },
893
+ "execution_count": null,
894
+ "metadata": {},
895
+ "output_type": "execute_result"
896
+ }
897
+ ],
898
+ "source": [
899
+ "#show result\n",
900
+ "ans"
901
+ ]
902
+ },
903
+ {
904
+ "cell_type": "markdown",
905
+ "metadata": {},
906
+ "source": [
907
+ "In conclusion, this is looking pretty solid. Let's leverage this functionality within the code base."
908
+ ]
909
+ }
910
+ ],
911
+ "metadata": {
912
+ "kernelspec": {
913
+ "display_name": "python3",
914
+ "language": "python",
915
+ "name": "python3"
916
+ }
917
+ },
918
+ "nbformat": 4,
919
+ "nbformat_minor": 2
920
+ }
nbs/nbdev.yml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ project:
2
+ output-dir: _docs
3
+
4
+ website:
5
+ title: "ai_classroom_suite"
6
+ site-url: "https://vanderbilt-data-science.github.io/lo-achievement"
7
+ description: "A repository supporting enhanced instruction and grading using AI"
8
+ repo-branch: main
9
+ repo-url: "https://github.com/vanderbilt-data-science/lo-achievement"
nbs/prompt_interaction_base.ipynb ADDED
@@ -0,0 +1,482 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# prompt_interaction_base.ipynb\n",
8
+ "> A notebook for formulating prompts and prompting\n",
9
+ "\n",
10
+ "In this notebook, we create some base functionality for creating prompts and getting answers for the LLMs in a simplified, unified way.\n",
11
+ "\n",
12
+ ":::{.callout-caution}\n",
13
+ "These notebooks are development notebooks, meaning that they are meant to be run locally or somewhere that supports navigating a full repository (in other words, not Google Colab unless you clone the entire repository to drive and then mount the Drive-Repository.) However, it is expected if you're able to do all of those steps, you're likely also able to figure out the required pip installs for development there.\n",
14
+ ":::\n"
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "raw",
19
+ "metadata": {},
20
+ "source": [
21
+ "---\n",
22
+ "skip_exec: true\n",
23
+ "---"
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "code",
28
+ "execution_count": null,
29
+ "metadata": {},
30
+ "outputs": [],
31
+ "source": [
32
+ "#| default_exp PromptInteractionBase"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": null,
38
+ "metadata": {},
39
+ "outputs": [],
40
+ "source": [
41
+ "#| export\n",
42
+ "from langchain.chat_models import ChatOpenAI\n",
43
+ "from langchain.llms import OpenAI\n",
44
+ "\n",
45
+ "from langchain import PromptTemplate\n",
46
+ "from langchain.prompts import ChatPromptTemplate, PromptTemplate\n",
47
+ "from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate\n",
48
+ "from langchain.chains import LLMChain, ConversationalRetrievalChain, RetrievalQAWithSourcesChain\n",
49
+ "from langchain.chains.base import Chain\n",
50
+ "\n",
51
+ "from getpass import getpass\n",
52
+ "\n",
53
+ "import os"
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "markdown",
58
+ "metadata": {},
59
+ "source": [
60
+ "## Model and Authentication Setup\n",
61
+ "Here, we create functionality to authenticate the user when needed specifically using OpenAI models. Additionally, we create the capacity to make LLMChains and other chains using one unified interface."
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "code",
66
+ "execution_count": null,
67
+ "metadata": {},
68
+ "outputs": [],
69
+ "source": [
70
+ "#| export\n",
71
+ "def create_model(openai_mdl='gpt-3.5-turbo-16k', temperature=0.1, **chatopenai_kwargs):\n",
72
+ " llm = ChatOpenAI(model_name = openai_mdl, temperature=temperature, **chatopenai_kwargs)\n",
73
+ "\n",
74
+ " return llm"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "code",
79
+ "execution_count": null,
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "#| export\n",
84
+ "def set_openai_key():\n",
85
+ " openai_api_key = getpass()\n",
86
+ " os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
87
+ "\n",
88
+ " return"
89
+ ]
90
+ },
91
+ {
92
+ "cell_type": "markdown",
93
+ "metadata": {},
94
+ "source": [
95
+ "**And now for a quick test of this functionality**"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "code",
100
+ "execution_count": null,
101
+ "metadata": {},
102
+ "outputs": [],
103
+ "source": [
104
+ "set_openai_key()\n",
105
+ "assert os.environ[\"OPENAI_API_KEY\"], \"Either you didn't run set_openai_key or you haven't set it to something.\"\n",
106
+ "\n",
107
+ "chat_mdl = create_model()\n",
108
+ "assert isinstance(chat_mdl, ChatOpenAI), \"The default model type is currently ChatOpenAI. If that has changed, change this test.\""
109
+ ]
110
+ },
111
+ {
112
+ "cell_type": "markdown",
113
+ "metadata": {},
114
+ "source": [
115
+ "## Create chat prompt templates\n",
116
+ "Here, we'll create a tutor prompt template to help us with self-study and quizzing, and help create the student messages."
117
+ ]
118
+ },
119
+ {
120
+ "cell_type": "code",
121
+ "execution_count": null,
122
+ "metadata": {},
123
+ "outputs": [],
124
+ "source": [
125
+ "#| export\n",
126
+ "# Create system prompt template\n",
127
+ "SYSTEM_TUTOR_TEMPLATE = (\"You are a world-class tutor helping students to perform better on oral and written exams though interactive experiences. \" +\n",
128
+ " \"When assessing and evaluating students, you always ask one question at a time, and wait for the student's response before \" +\n",
129
+ " \"providing them with feedback. Asking one question at a time, waiting for the student's response, and then commenting \" +\n",
130
+ " \"on the strengths and weaknesses of their responses (when appropriate) is what makes you such a sought-after, world-class tutor.\")\n",
131
+ "\n",
132
+ "# Create a human response template\n",
133
+ "HUMAN_RESPONSE_TEMPLATE = (\"I'm trying to better understand the text provided below. {assessment_request} The learning objectives to be assessed are: \" +\n",
134
+ " \"{learning_objectives}. Although I may request more than one assessment question, you should \" +\n",
135
+ " \"only provide ONE question in you initial response. Do not include the answer in your response. \" +\n",
136
+ " \"If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional \" +\n",
137
+ " \"chances to respond until I get the correct choice. Explain why the correct choice is right. \" +\n",
138
+ " \"The text that you will base your questions on is as follows: {context}.\")\n",
139
+ "\n",
140
+ "HUMAN_RETRIEVER_RESPONSE_TEMPLATE = (\"I want to master the topics based on the excerpts of the text below. Given the following extracted text from long documents, {assessment_request} The learning objectives to be assessed are: \" +\n",
141
+ " \"{learning_objectives}. Although I may request more than one assessment question, you should \" +\n",
142
+ " \"only provide ONE question in you initial response. Do not include the answer in your response. \" +\n",
143
+ " \"If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional \" +\n",
144
+ " \"chances to respond until I get the correct choice. Explain why the correct choice is right. \" +\n",
145
+ " \"The extracted text from long documents are as follows: {summaries}.\")\n",
146
+ "\n",
147
+ "def create_base_tutoring_prompt(system_prompt=None, human_prompt=None):\n",
148
+ "\n",
149
+ " #setup defaults using defined values\n",
150
+ " if system_prompt == None:\n",
151
+ " system_prompt = PromptTemplate(template = SYSTEM_TUTOR_TEMPLATE,\n",
152
+ " input_variables = [])\n",
153
+ " \n",
154
+ " if human_prompt==None:\n",
155
+ " human_prompt = PromptTemplate(template = HUMAN_RESPONSE_TEMPLATE,\n",
156
+ " input_variables=['assessment_request', 'learning_objectives', 'context'])\n",
157
+ "\n",
158
+ " # Create prompt messages\n",
159
+ " system_tutor_msg = SystemMessagePromptTemplate(prompt=system_prompt)\n",
160
+ " human_tutor_msg = HumanMessagePromptTemplate(prompt= human_prompt)\n",
161
+ "\n",
162
+ " # Create ChatPromptTemplate\n",
163
+ " chat_prompt = ChatPromptTemplate.from_messages([system_tutor_msg, human_tutor_msg])\n",
164
+ "\n",
165
+ " return chat_prompt"
166
+ ]
167
+ },
168
+ {
169
+ "cell_type": "markdown",
170
+ "metadata": {},
171
+ "source": [
172
+ "Now for a quick unit test for testing..."
173
+ ]
174
+ },
175
+ {
176
+ "cell_type": "code",
177
+ "execution_count": null,
178
+ "metadata": {},
179
+ "outputs": [],
180
+ "source": [
181
+ "chat_prompt = create_base_tutoring_prompt()\n",
182
+ "assert chat_prompt.messages[0].prompt.template == SYSTEM_TUTOR_TEMPLATE, \"Did not set up the first chat_prompt to be SystemMessage\"\n",
183
+ "assert chat_prompt.messages[1].prompt.template == HUMAN_RESPONSE_TEMPLATE, \"Did not set up the second element of chat_prompt to be HumanMessage\""
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "markdown",
188
+ "metadata": {},
189
+ "source": [
190
+ "Now, let's define a function that allows us to set up default variables in case the user chooses not to pass something in."
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "code",
195
+ "execution_count": null,
196
+ "metadata": {},
197
+ "outputs": [],
198
+ "source": [
199
+ "#| export\n",
200
+ "DEFAULT_ASSESSMENT_MSG = 'Please design a 5 question short answer quiz about the provided text.'\n",
201
+ "DEFAULT_LEARNING_OBJS_MSG = 'Identify and comprehend the important topics and underlying messages and connections within the text'\n",
202
+ "\n",
203
+ "def get_tutoring_prompt(context, chat_template=None, assessment_request = None, learning_objectives = None, **kwargs):\n",
204
+ "\n",
205
+ " # set defaults\n",
206
+ " if chat_template is None:\n",
207
+ " chat_template = create_base_tutoring_prompt()\n",
208
+ " else:\n",
209
+ " if not all([prompt_var in chat_template.input_variables\n",
210
+ " for prompt_var in ['context', 'assessment_request', 'learning_objectives']]):\n",
211
+ " raise KeyError('''It looks like you may have a custom chat_template. Either include context, assessment_request, and learning objectives\n",
212
+ " as input variables or create your own tutoring prompt.''')\n",
213
+ "\n",
214
+ " if assessment_request is None and 'assessment_request':\n",
215
+ " assessment_request = DEFAULT_ASSESSMENT_MSG\n",
216
+ " \n",
217
+ " if learning_objectives is None:\n",
218
+ " learning_objectives = DEFAULT_LEARNING_OBJS_MSG\n",
219
+ " \n",
220
+ " # compose final prompt\n",
221
+ " tutoring_prompt = chat_template.format_prompt(context=context,\n",
222
+ " assessment_request = assessment_request,\n",
223
+ " learning_objectives = learning_objectives,\n",
224
+ " **kwargs)\n",
225
+ " \n",
226
+ " return tutoring_prompt\n"
227
+ ]
228
+ },
229
+ {
230
+ "cell_type": "markdown",
231
+ "metadata": {},
232
+ "source": [
233
+ "**Another quick unit test...**"
234
+ ]
235
+ },
236
+ {
237
+ "cell_type": "code",
238
+ "execution_count": null,
239
+ "metadata": {},
240
+ "outputs": [
241
+ {
242
+ "data": {
243
+ "text/plain": [
244
+ "[SystemMessage(content=\"You are a world-class tutor helping students to perform better on oral and written exams though interactive experiences.\\nWhen assessing and evaluating students, you always ask one question at a time, and wait for the student's response before providing them with feedback.\\nAsking one question at a time, waiting for the student's response, and then commenting on the strengths and weaknesses of their responses (when appropriate)\\nis what makes you such a sought-after, world-class tutor.\", additional_kwargs={}),\n",
245
+ " HumanMessage(content=\"I'm trying to better understand the text provided below. Please design a 5 question short answer quiz about the provided text. The learning objectives to be assessed are:\\nIdentify and comprehend the important topics and underlying messages and connections within the text. Although I may request more than one assessment question, you should\\nonly provide ONE question in you initial response. Do not include the answer in your response.\\nIf I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional\\nchances to respond until I get the correct choice. Explain why the correct choice is right.\\nThe text that you will base your questions on is as follows: The dog was super pretty and cute.\", additional_kwargs={}, example=False)]"
246
+ ]
247
+ },
248
+ "execution_count": null,
249
+ "metadata": {},
250
+ "output_type": "execute_result"
251
+ }
252
+ ],
253
+ "source": [
254
+ "# For defaults\n",
255
+ "res = get_tutoring_prompt('The dog was super pretty and cute').to_messages()\n",
256
+ "res"
257
+ ]
258
+ },
259
+ {
260
+ "cell_type": "markdown",
261
+ "metadata": {},
262
+ "source": [
263
+ "Now, let's finally define how we can get the chat response from the model."
264
+ ]
265
+ },
266
+ {
267
+ "cell_type": "code",
268
+ "execution_count": null,
269
+ "metadata": {},
270
+ "outputs": [],
271
+ "source": [
272
+ "#| export\n",
273
+ "def get_tutoring_answer(context, tutor_mdl, chat_template=None, assessment_request=None, learning_objectives=None, return_dict=False, call_kwargs={}, input_kwargs={}):\n",
274
+ " \n",
275
+ " # Get answer from chat\n",
276
+ " \n",
277
+ " # set defaults\n",
278
+ " if assessment_request is None:\n",
279
+ " assessment_request = DEFAULT_ASSESSMENT_MSG\n",
280
+ " if learning_objectives is None:\n",
281
+ " learning_objectives = DEFAULT_LEARNING_OBJS_MSG\n",
282
+ " \n",
283
+ " common_inputs = {'assessment_request':assessment_request, 'learning_objectives':learning_objectives}\n",
284
+ " \n",
285
+ " # get answer based on interaction type\n",
286
+ " if isinstance(tutor_mdl, ChatOpenAI):\n",
287
+ " human_ask_prompt = get_tutoring_prompt(context, chat_template, assessment_request, learning_objectives)\n",
288
+ " tutor_answer = tutor_mdl(human_ask_prompt.to_messages())\n",
289
+ "\n",
290
+ " if not return_dict:\n",
291
+ " final_answer = tutor_answer.content\n",
292
+ " \n",
293
+ " elif isinstance(tutor_mdl, Chain):\n",
294
+ " if isinstance(tutor_mdl, RetrievalQAWithSourcesChain):\n",
295
+ " if 'question' not in input_kwargs.keys():\n",
296
+ " common_inputs['question'] = assessment_request\n",
297
+ " final_inputs = {**common_inputs, **input_kwargs}\n",
298
+ " else:\n",
299
+ " common_inputs['context'] = context\n",
300
+ " final_inputs = {**common_inputs, **input_kwargs}\n",
301
+ " \n",
302
+ " # get answer\n",
303
+ " tutor_answer = tutor_mdl(final_inputs, **call_kwargs)\n",
304
+ " final_answer = tutor_answer\n",
305
+ "\n",
306
+ " if not return_dict:\n",
307
+ " final_answer = final_answer['answer']\n",
308
+ " \n",
309
+ " else:\n",
310
+ " raise NotImplementedError(f\"tutor_mdl of type {type(tutor_mdl)} is not supported.\")\n",
311
+ "\n",
312
+ " return final_answer"
313
+ ]
314
+ },
315
+ {
316
+ "cell_type": "code",
317
+ "execution_count": null,
318
+ "metadata": {},
319
+ "outputs": [],
320
+ "source": [
321
+ "#| export\n",
322
+ "\n",
323
+ "DEFAULT_CONDENSE_PROMPT_TEMPLATE = (\"Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, \" + \n",
324
+ " \"in its original language.\\n\\nChat History:\\n{chat_history}\\nFollow Up Input: {question}\\nStandalone question:\")\n",
325
+ "\n",
326
+ "DEFAULT_QUESTION_PROMPT_TEMPLATE = (\"Use the following portion of a long document to see if any of the text is relevant to creating a response to the question.\" +\n",
327
+ " \"\\nReturn any relevant text verbatim.\\n{context}\\nQuestion: {question}\\nRelevant text, if any:\")\n",
328
+ "\n",
329
+ "DEFAULT_COMBINE_PROMPT_TEMPLATE = (\"Given the following extracted parts of a long document and the given prompt, create a final answer with references ('SOURCES'). \"+\n",
330
+ " \"If you don't have a response, just say that you are unable to come up with a response. \"+\n",
331
+ " \"\\nSOURCES:\\n\\nQUESTION: {question}\\n=========\\n{summaries}\\n=========\\nFINAL ANSWER:'\")\n",
332
+ "\n",
333
+ "def create_tutor_mdl_chain(kind='llm', mdl=None, prompt_template = None, **kwargs):\n",
334
+ " \n",
335
+ " #Validate parameters\n",
336
+ " if mdl is None:\n",
337
+ " mdl = create_model()\n",
338
+ " kind = kind.lower()\n",
339
+ " \n",
340
+ " #Create model chain\n",
341
+ " if kind == 'llm':\n",
342
+ " if prompt_template is None:\n",
343
+ " prompt_template = create_base_tutoring_prompt()\n",
344
+ " mdl_chain = LLMChain(llm=mdl, prompt=prompt_template, **kwargs)\n",
345
+ " elif kind == 'conversational':\n",
346
+ " if prompt_template is None:\n",
347
+ " prompt_template = PromptTemplate.from_template(DEFAULT_CONDENSE_PROMPT_TEMPLATE)\n",
348
+ " mdl_chain = ConversationalRetrieverChain.from_llm(mdl, condense_question_prompt = prompt_template, **kwargs)\n",
349
+ " elif kind == 'retrieval_qa':\n",
350
+ " if prompt_template is None:\n",
351
+ "\n",
352
+ " #Create custom human prompt to take in summaries\n",
353
+ " human_prompt = PromptTemplate(template = HUMAN_RETRIEVER_RESPONSE_TEMPLATE,\n",
354
+ " input_variables=['assessment_request', 'learning_objectives', 'summaries'])\n",
355
+ " prompt_template = create_base_tutoring_prompt(human_prompt=human_prompt)\n",
356
+ " \n",
357
+ " #Create the combination prompt and model\n",
358
+ " question_template = PromptTemplate.from_template(DEFAULT_QUESTION_PROMPT_TEMPLATE)\n",
359
+ " mdl_chain = RetrievalQAWithSourcesChain.from_llm(llm=mdl, question_prompt=question_template, combine_prompt = prompt_template, **kwargs)\n",
360
+ " else:\n",
361
+ " raise NotImplementedError(f\"Model kind {kind} not implemented\")\n",
362
+ " \n",
363
+ " return mdl_chain"
364
+ ]
365
+ },
366
+ {
367
+ "cell_type": "markdown",
368
+ "metadata": {},
369
+ "source": [
370
+ "**Another brief test of behavior of these functions**"
371
+ ]
372
+ },
373
+ {
374
+ "cell_type": "code",
375
+ "execution_count": null,
376
+ "metadata": {},
377
+ "outputs": [],
378
+ "source": [
379
+ "res = get_tutoring_answer('The dog is super cute', chat_mdl)\n",
380
+ "print(res)"
381
+ ]
382
+ },
383
+ {
384
+ "cell_type": "markdown",
385
+ "metadata": {},
386
+ "source": [
387
+ "### Validate LLM Chain"
388
+ ]
389
+ },
390
+ {
391
+ "cell_type": "code",
392
+ "execution_count": null,
393
+ "metadata": {},
394
+ "outputs": [],
395
+ "source": [
396
+ "# Try llm model chain, making sure we've set the API key\n",
397
+ "llm_chain_test = create_tutor_mdl_chain('llm')\n",
398
+ "res = llm_chain_test.run({'context':'some context', 'assessment_request':'some assessment', 'learning_objectives':'some prompt'})"
399
+ ]
400
+ },
401
+ {
402
+ "cell_type": "code",
403
+ "execution_count": null,
404
+ "metadata": {},
405
+ "outputs": [
406
+ {
407
+ "name": "stdout",
408
+ "output_type": "stream",
409
+ "text": [
410
+ "<class 'langchain.chains.llm.LLMChain'>\n"
411
+ ]
412
+ },
413
+ {
414
+ "data": {
415
+ "text/plain": [
416
+ "'Sure, I can help you with that. Please provide me with the specific text that you would like me to base my questions on.'"
417
+ ]
418
+ },
419
+ "execution_count": null,
420
+ "metadata": {},
421
+ "output_type": "execute_result"
422
+ }
423
+ ],
424
+ "source": [
425
+ "# Verify information about the cell above\n",
426
+ "print(type(llm_chain_test))\n",
427
+ "print(res)\n",
428
+ "\n",
429
+ "# unit tests\n",
430
+ "assert isinstance(llm_chain_test, LLMChain), 'the output of llm create_tutor_mdl_chain should be an instance of LLMChain'\n",
431
+ "assert isinstance(res, str), 'the output of running the llm chain should be of type string.'"
432
+ ]
433
+ },
434
+ {
435
+ "cell_type": "markdown",
436
+ "metadata": {},
437
+ "source": [
438
+ "Now, we'll try this with just the default function to run things..."
439
+ ]
440
+ },
441
+ {
442
+ "cell_type": "code",
443
+ "execution_count": null,
444
+ "metadata": {},
445
+ "outputs": [
446
+ {
447
+ "data": {
448
+ "text/plain": [
449
+ "{'context': 'some context',\n",
450
+ " 'assessment_request': 'Please design a 5 question short answer quiz about the provided text.',\n",
451
+ " 'learning_objectives': 'Identify and comprehend the important topics and underlying messages and connections within the text',\n",
452
+ " 'text': 'Question 1: What are the main topics discussed in the text?\\n\\n(Note: Please provide your answer and I will provide feedback accordingly.)'}"
453
+ ]
454
+ },
455
+ "execution_count": null,
456
+ "metadata": {},
457
+ "output_type": "execute_result"
458
+ }
459
+ ],
460
+ "source": [
461
+ "res = get_tutoring_answer(context='some context', tutor_mdl = llm_chain_test)\n",
462
+ "res"
463
+ ]
464
+ },
465
+ {
466
+ "cell_type": "markdown",
467
+ "metadata": {},
468
+ "source": [
469
+ "OK, this base functionality is looking good."
470
+ ]
471
+ }
472
+ ],
473
+ "metadata": {
474
+ "kernelspec": {
475
+ "display_name": "python3",
476
+ "language": "python",
477
+ "name": "python3"
478
+ }
479
+ },
480
+ "nbformat": 4,
481
+ "nbformat_minor": 2
482
+ }
nbs/self_study_prompts.ipynb ADDED
@@ -0,0 +1,342 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# self_study_prompts.ipynb\n",
8
+ "> A listing of all prompts for self-study\n",
9
+ "\n",
10
+ "This notebook contains all prompts used for self-study as a central place that can be monitored and evaluated for appropriate functionality. Note that these perform the requests part of the prompts.\n",
11
+ "\n",
12
+ ":::{.callout-caution}\n",
13
+ "These notebooks are development notebooks, meaning that they are meant to be run locally or somewhere that supports navigating a full repository (in other words, not Google Colab unless you clone the entire repository to drive and then mount the Drive-Repository.) However, it is expected if you're able to do all of those steps, you're likely also able to figure out the required pip installs for development there.\n",
14
+ ":::"
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "raw",
19
+ "metadata": {},
20
+ "source": [
21
+ "---\n",
22
+ "skip_exec: true\n",
23
+ "---"
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "code",
28
+ "execution_count": null,
29
+ "metadata": {},
30
+ "outputs": [],
31
+ "source": [
32
+ "#| default_exp SelfStudyPrompts"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "markdown",
37
+ "metadata": {},
38
+ "source": [
39
+ "## Self-study texts\n",
40
+ "We'll now define the text for our self-study questions. Note that these will align with `assessment_request` in the `PromptInteractionBase` module."
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "code",
45
+ "execution_count": null,
46
+ "metadata": {},
47
+ "outputs": [],
48
+ "source": [
49
+ "#| export\n",
50
+ "# used for pretty display\n",
51
+ "import pandas as pd"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "code",
56
+ "execution_count": null,
57
+ "metadata": {},
58
+ "outputs": [],
59
+ "source": [
60
+ "#| export\n",
61
+ "MC_QUIZ_DEFAULT = \"Please design a 5 question multiple choice quiz about the provided text.\"\n",
62
+ "\n",
63
+ "SHORT_ANSWER_DEFAULT = (\"Please design a 5 question short answer quiz about the provided text. \"\n",
64
+ " \"The question types should be short answer. Expect the correct answers to be a few sentences long.\")\n",
65
+ "\n",
66
+ "FILL_BLANK_DEFAULT = \"\"\"Create a 5 question fill in the blank quiz referencing parts of the provided text.\n",
67
+ "The \"blank\" part of the question should appear as \"________\". The answers should reflect what word(s) should go in the blank an accurate statement.\n",
68
+ "An example is as follows: \"The author of the book is ______.\" The question should be a statement.\n",
69
+ "\"\"\"\n",
70
+ "\n",
71
+ "SEQUENCING_DEFAULT = \"\"\"Create a 5 question questionnaire that will ask me to recall the steps or sequence of events\n",
72
+ "in the provided text.\"\"\"\n",
73
+ "\n",
74
+ "RELATIONSHIP_DEFAULT = (\"Create a 5 question quiz for the student that asks the student to identify relationships between\"\n",
75
+ " \"topics or concepts that are important to understanding this text.\")\n",
76
+ "\n",
77
+ "CONCEPTS_DEFAULT = \"\"\" Design a 5 question quiz that asks me about definitions or concepts of importance in the provided text.\"\"\"\n",
78
+ "\n",
79
+ "REAL_WORLD_EXAMPLE_DEFAULT = \"\"\"Demonstrate how the provided context can be applied to solve a real world problem.\n",
80
+ "Ask me questions about how the demonstration you provided relates to solving a real world problem.\"\"\"\n",
81
+ "\n",
82
+ "RANDOMIZED_QUESTIONS_DEFAULT = \"\"\"Generate a high-quality assessment consisting of 5 varied questions,\n",
83
+ "each of different types (open-ended, multiple choice, short answer, analogies, etc.)\"\"\"\n",
84
+ "\n",
85
+ "SELF_STUDY_PROMPT_NAMES = ['MC_QUIZ_DEFAULT',\n",
86
+ "'SHORT_ANSWER_DEFAULT',\n",
87
+ "'FILL_BLANK_DEFAULT',\n",
88
+ "'SEQUENCING_DEFAULT',\n",
89
+ "'RELATIONSHIP_DEFAULT',\n",
90
+ "'CONCEPTS_DEFAULT',\n",
91
+ "'REAL_WORLD_EXAMPLE_DEFAULT',\n",
92
+ "'RANDOMIZED_QUESTIONS_DEFAULT']"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "markdown",
97
+ "metadata": {},
98
+ "source": [
99
+ "## Create functions to assist with creating prompts\n",
100
+ "Now, we'll use this section in order to create some functions which will allow the user to display all available prompts."
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "metadata": {},
107
+ "outputs": [],
108
+ "source": [
109
+ "#| export\n",
110
+ "# Define self study dictionary for lookup\n",
111
+ "SELF_STUDY_DEFAULTS = {'mc': MC_QUIZ_DEFAULT,\n",
112
+ "'short_answer': SHORT_ANSWER_DEFAULT,\n",
113
+ "'fill_blank': FILL_BLANK_DEFAULT,\n",
114
+ "'sequencing': SEQUENCING_DEFAULT,\n",
115
+ "'relationships': RELATIONSHIP_DEFAULT,\n",
116
+ "'concepts': CONCEPTS_DEFAULT,\n",
117
+ "'real_world_example': REAL_WORLD_EXAMPLE_DEFAULT,\n",
118
+ "'randomized_questions': RANDOMIZED_QUESTIONS_DEFAULT\n",
119
+ "} \n",
120
+ "\n",
121
+ "# Return list of all self study prompts\n",
122
+ "def list_all_self_study_prompt_keys():\n",
123
+ " return list(SELF_STUDY_DEFAULTS.keys())\n",
124
+ "\n",
125
+ "def list_all_self_study_prompts():\n",
126
+ " return list(SELF_STUDY_DEFAULTS.values())\n",
127
+ " \n",
128
+ "# Return list of all self study variable names\n",
129
+ "def list_default_self_prompt_varnames():\n",
130
+ " return SELF_STUDY_PROMPT_NAMES\n",
131
+ "\n",
132
+ "# Print as a table\n",
133
+ "def print_all_self_study_prompts():\n",
134
+ " with pd.option_context('max_colwidth', None):\n",
135
+ " display(pd.DataFrame({'SELF_STUDY_DEFAULTS key': list(SELF_STUDY_DEFAULTS.keys()),\n",
136
+ " 'Prompt': list(SELF_STUDY_DEFAULTS.values())}))\n"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "markdown",
141
+ "metadata": {},
142
+ "source": [
143
+ "Now, we'll have quick unit test just to make sure this is working correctly."
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "code",
148
+ "execution_count": null,
149
+ "metadata": {},
150
+ "outputs": [
151
+ {
152
+ "data": {
153
+ "text/plain": [
154
+ "['mc',\n",
155
+ " 'short_answers',\n",
156
+ " 'fill_blanks',\n",
157
+ " 'sequencing',\n",
158
+ " 'relationships',\n",
159
+ " 'concepts',\n",
160
+ " 'real_world_example',\n",
161
+ " 'randomized_questions']"
162
+ ]
163
+ },
164
+ "execution_count": null,
165
+ "metadata": {},
166
+ "output_type": "execute_result"
167
+ }
168
+ ],
169
+ "source": [
170
+ "list_all_self_study_prompt_keys()"
171
+ ]
172
+ },
173
+ {
174
+ "cell_type": "code",
175
+ "execution_count": null,
176
+ "metadata": {},
177
+ "outputs": [
178
+ {
179
+ "data": {
180
+ "text/plain": [
181
+ "['Please design a 5 question quiz about the provided text.',\n",
182
+ " 'Please design a 5 question short answer quiz about the provided text. The question types should be short answer. Expect the correct answers to be a few sentences long.',\n",
183
+ " 'Create a 5 question fill in the blank quiz referencing parts of the provided text.\\nThe \"blank\" part of the question should appear as \"________\". The answers should reflect what word(s) should go in the blank an accurate statement.\\nAn example is as follows: \"The author of the book is ______.\" The question should be a statement.\\n',\n",
184
+ " 'Create a 5 question questionnaire that will ask me to recall the steps or sequence of events\\nin the provided text.',\n",
185
+ " 'Please design a 5 question quiz that asks me to draw or explain relationships\\nbetween important concepts or topics in the provided text.',\n",
186
+ " ' Design a 5 question quiz that asks me about definitions or concepts of importance in the provided text.',\n",
187
+ " 'Demonstrate how the provided context can be applied to solve a real world problem.\\nAsk me questions about how the demonstration you provided relates to solving a real world problem.',\n",
188
+ " 'Generate a high-quality assessment consisting of 5 varied questions,\\neach of different types (open-ended, multiple choice, short answer, analogies, etc.)']"
189
+ ]
190
+ },
191
+ "execution_count": null,
192
+ "metadata": {},
193
+ "output_type": "execute_result"
194
+ }
195
+ ],
196
+ "source": [
197
+ "list_all_self_study_prompts()"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": null,
203
+ "metadata": {},
204
+ "outputs": [
205
+ {
206
+ "data": {
207
+ "text/plain": [
208
+ "['MC_QUIZ_DEFAULT',\n",
209
+ " 'SHORT_ANSWER_DEFAULT',\n",
210
+ " 'FILL_BLANK_DEFAULT',\n",
211
+ " 'SEQUENCING_DEFAULT',\n",
212
+ " 'RELATIONSHIP_DEFAULT',\n",
213
+ " 'CONCEPTS_DEFAULT',\n",
214
+ " 'REAL_WORLD_EXAMPLE_DEFAULT',\n",
215
+ " 'RANDOMIZED_QUESTIONS_DEFAULT']"
216
+ ]
217
+ },
218
+ "execution_count": null,
219
+ "metadata": {},
220
+ "output_type": "execute_result"
221
+ }
222
+ ],
223
+ "source": [
224
+ "list_default_self_prompt_varnames()"
225
+ ]
226
+ },
227
+ {
228
+ "cell_type": "code",
229
+ "execution_count": null,
230
+ "metadata": {},
231
+ "outputs": [
232
+ {
233
+ "data": {
234
+ "text/html": [
235
+ "<div>\n",
236
+ "<style scoped>\n",
237
+ " .dataframe tbody tr th:only-of-type {\n",
238
+ " vertical-align: middle;\n",
239
+ " }\n",
240
+ "\n",
241
+ " .dataframe tbody tr th {\n",
242
+ " vertical-align: top;\n",
243
+ " }\n",
244
+ "\n",
245
+ " .dataframe thead th {\n",
246
+ " text-align: right;\n",
247
+ " }\n",
248
+ "</style>\n",
249
+ "<table border=\"1\" class=\"dataframe\">\n",
250
+ " <thead>\n",
251
+ " <tr style=\"text-align: right;\">\n",
252
+ " <th></th>\n",
253
+ " <th>Variable Name</th>\n",
254
+ " <th>Prompt</th>\n",
255
+ " </tr>\n",
256
+ " </thead>\n",
257
+ " <tbody>\n",
258
+ " <tr>\n",
259
+ " <th>0</th>\n",
260
+ " <td>mc</td>\n",
261
+ " <td>Please design a 5 question quiz about the provided text.</td>\n",
262
+ " </tr>\n",
263
+ " <tr>\n",
264
+ " <th>1</th>\n",
265
+ " <td>short_answers</td>\n",
266
+ " <td>Please design a 5 question short answer quiz about the provided text. The question types should be short answer. Expect the correct answers to be a few sentences long.</td>\n",
267
+ " </tr>\n",
268
+ " <tr>\n",
269
+ " <th>2</th>\n",
270
+ " <td>fill_blanks</td>\n",
271
+ " <td>Create a 5 question fill in the blank quiz referencing parts of the provided text.\\nThe \"blank\" part of the question should appear as \"________\". The answers should reflect what word(s) should go in the blank an accurate statement.\\nAn example is as follows: \"The author of the book is ______.\" The question should be a statement.\\n</td>\n",
272
+ " </tr>\n",
273
+ " <tr>\n",
274
+ " <th>3</th>\n",
275
+ " <td>sequencing</td>\n",
276
+ " <td>Create a 5 question questionnaire that will ask me to recall the steps or sequence of events\\nin the provided text.</td>\n",
277
+ " </tr>\n",
278
+ " <tr>\n",
279
+ " <th>4</th>\n",
280
+ " <td>relationships</td>\n",
281
+ " <td>Please design a 5 question quiz that asks me to draw or explain relationships\\nbetween important concepts or topics in the provided text.</td>\n",
282
+ " </tr>\n",
283
+ " <tr>\n",
284
+ " <th>5</th>\n",
285
+ " <td>concepts</td>\n",
286
+ " <td>Design a 5 question quiz that asks me about definitions or concepts of importance in the provided text.</td>\n",
287
+ " </tr>\n",
288
+ " <tr>\n",
289
+ " <th>6</th>\n",
290
+ " <td>real_world_example</td>\n",
291
+ " <td>Demonstrate how the provided context can be applied to solve a real world problem.\\nAsk me questions about how the demonstration you provided relates to solving a real world problem.</td>\n",
292
+ " </tr>\n",
293
+ " <tr>\n",
294
+ " <th>7</th>\n",
295
+ " <td>randomized_questions</td>\n",
296
+ " <td>Generate a high-quality assessment consisting of 5 varied questions,\\neach of different types (open-ended, multiple choice, short answer, analogies, etc.)</td>\n",
297
+ " </tr>\n",
298
+ " </tbody>\n",
299
+ "</table>\n",
300
+ "</div>"
301
+ ],
302
+ "text/plain": [
303
+ " Variable Name \\\n",
304
+ "0 mc \n",
305
+ "1 short_answers \n",
306
+ "2 fill_blanks \n",
307
+ "3 sequencing \n",
308
+ "4 relationships \n",
309
+ "5 concepts \n",
310
+ "6 real_world_example \n",
311
+ "7 randomized_questions \n",
312
+ "\n",
313
+ " Prompt \n",
314
+ "0 Please design a 5 question quiz about the provided text. \n",
315
+ "1 Please design a 5 question short answer quiz about the provided text. The question types should be short answer. Expect the correct answers to be a few sentences long. \n",
316
+ "2 Create a 5 question fill in the blank quiz referencing parts of the provided text.\\nThe \"blank\" part of the question should appear as \"________\". The answers should reflect what word(s) should go in the blank an accurate statement.\\nAn example is as follows: \"The author of the book is ______.\" The question should be a statement.\\n \n",
317
+ "3 Create a 5 question questionnaire that will ask me to recall the steps or sequence of events\\nin the provided text. \n",
318
+ "4 Please design a 5 question quiz that asks me to draw or explain relationships\\nbetween important concepts or topics in the provided text. \n",
319
+ "5 Design a 5 question quiz that asks me about definitions or concepts of importance in the provided text. \n",
320
+ "6 Demonstrate how the provided context can be applied to solve a real world problem.\\nAsk me questions about how the demonstration you provided relates to solving a real world problem. \n",
321
+ "7 Generate a high-quality assessment consisting of 5 varied questions,\\neach of different types (open-ended, multiple choice, short answer, analogies, etc.) "
322
+ ]
323
+ },
324
+ "metadata": {},
325
+ "output_type": "display_data"
326
+ }
327
+ ],
328
+ "source": [
329
+ "print_all_self_study_prompts()"
330
+ ]
331
+ }
332
+ ],
333
+ "metadata": {
334
+ "kernelspec": {
335
+ "display_name": "python3",
336
+ "language": "python",
337
+ "name": "python3"
338
+ }
339
+ },
340
+ "nbformat": 4,
341
+ "nbformat_minor": 2
342
+ }
nbs/styles.css ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .cell {
2
+ margin-bottom: 1rem;
3
+ }
4
+
5
+ .cell > .sourceCode {
6
+ margin-bottom: 0;
7
+ }
8
+
9
+ .cell-output > pre {
10
+ margin-bottom: 0;
11
+ }
12
+
13
+ .cell-output > pre, .cell-output > .sourceCode > pre, .cell-output-stdout > pre {
14
+ margin-left: 0.8rem;
15
+ margin-top: 0;
16
+ background: none;
17
+ border-left: 2px solid lightsalmon;
18
+ border-top-left-radius: 0;
19
+ border-top-right-radius: 0;
20
+ }
21
+
22
+ .cell-output > .sourceCode {
23
+ border: none;
24
+ }
25
+
26
+ .cell-output > .sourceCode {
27
+ background: none;
28
+ margin-top: 0;
29
+ }
30
+
31
+ div.description {
32
+ padding-left: 2px;
33
+ padding-top: 5px;
34
+ font-style: italic;
35
+ font-size: 135%;
36
+ opacity: 70%;
37
+ }
prompt_with_context.ipynb ADDED
@@ -0,0 +1,796 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "<a href=\"https://colab.research.google.com/github/vanderbilt-data-science/lo-achievement/blob/main/prompt_with_context.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "# LLMs for Self-Study\n",
15
+ "> A prompt and code template for better understanding texts\n",
16
+ "\n",
17
+ "This notebook provides a guide for using LLMs for self-study programmatically. A number of prompt templates are provided to assist with generating great assessments for self-study, and code is additionally provided for fast usage. This notebook is best leveraged for a set of documents (text or PDF preferred) **to be uploaded** for interaction with the model.\n",
18
+ "\n",
19
+ "This version of the notebook is best suited for those who prefer to use files from their local drive as context rather than copy and pasting directly into the notebook to be used as context for the model. If you prefer to copy and paste text, you should direct yourself to the [prompt_with_context](https://colab.research.google.com/github/vanderbilt-data-science/lo-achievement/blob/main/prompt_with_context.ipynb) notebook."
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "markdown",
24
+ "metadata": {},
25
+ "source": [
26
+ "# Code Setup\n",
27
+ "Run the following cells to setup the rest of the environment for prompting. In the following section, we set up the computational environment with imported code, setup your API key access to OpenAI, and loading access to your language model. Note that the following cells may take a long time to run."
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "markdown",
32
+ "metadata": {},
33
+ "source": [
34
+ "## Library installation and loading\n",
35
+ "The following `pip install` code should be run if you're using Google Colab, or otherwise do not have a computational environment (e.g., _venv_, _conda virtual environment_, _Docker, Singularity, or other container_) with these packages installed."
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "raw",
40
+ "metadata": {},
41
+ "source": [
42
+ "---\n",
43
+ "skip_exec: true\n",
44
+ "---"
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "code",
49
+ "execution_count": null,
50
+ "metadata": {},
51
+ "outputs": [],
52
+ "source": [
53
+ "# run this code if you're using Google Colab or don't have these packages installed in your computing environment\n",
54
+ "! pip install pip install git+https://<token>@github.com/vanderbilt-data-science/lo-achievement.git"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "code",
59
+ "execution_count": null,
60
+ "metadata": {},
61
+ "outputs": [],
62
+ "source": [
63
+ "# import required libraries\n",
64
+ "import numpy as np\n",
65
+ "import getpass\n",
66
+ "import os\n",
67
+ "from langchain.chat_models import ChatOpenAI\n",
68
+ "from langchain.chains import RetrievalQA\n",
69
+ "from langchain.schema import SystemMessage, HumanMessage, AIMessage"
70
+ ]
71
+ },
72
+ {
73
+ "cell_type": "code",
74
+ "execution_count": null,
75
+ "metadata": {},
76
+ "outputs": [],
77
+ "source": [
78
+ "# libraries from our package\n",
79
+ "from ai_classroom_suite.PromptInteractionBase import *\n",
80
+ "import ai_classroom_suite.SelfStudyPrompts as ssp"
81
+ ]
82
+ },
83
+ {
84
+ "cell_type": "markdown",
85
+ "metadata": {},
86
+ "source": [
87
+ "## API and model setup\n",
88
+ "\n",
89
+ "Use these cells to load the API keys required for this notebook and create a basic OpenAI LLM model. The code below uses the variable you created above when you input your API Key."
90
+ ]
91
+ },
92
+ {
93
+ "cell_type": "code",
94
+ "execution_count": null,
95
+ "metadata": {},
96
+ "outputs": [],
97
+ "source": [
98
+ "# Set up OpenAI API Key\n",
99
+ "set_openai_key()\n",
100
+ "\n",
101
+ "# Create model\n",
102
+ "mdl = create_model()"
103
+ ]
104
+ },
105
+ {
106
+ "cell_type": "markdown",
107
+ "metadata": {},
108
+ "source": [
109
+ "# Inspect Available Default Prompts\n",
110
+ "A number of default prompts have been provided for you so you don't need to form your own prompt to begin with. They will be listed below, and the different ways to interact with them are displayed."
111
+ ]
112
+ },
113
+ {
114
+ "cell_type": "code",
115
+ "execution_count": null,
116
+ "metadata": {},
117
+ "outputs": [
118
+ {
119
+ "data": {
120
+ "text/html": [
121
+ "<div>\n",
122
+ "<style scoped>\n",
123
+ " .dataframe tbody tr th:only-of-type {\n",
124
+ " vertical-align: middle;\n",
125
+ " }\n",
126
+ "\n",
127
+ " .dataframe tbody tr th {\n",
128
+ " vertical-align: top;\n",
129
+ " }\n",
130
+ "\n",
131
+ " .dataframe thead th {\n",
132
+ " text-align: right;\n",
133
+ " }\n",
134
+ "</style>\n",
135
+ "<table border=\"1\" class=\"dataframe\">\n",
136
+ " <thead>\n",
137
+ " <tr style=\"text-align: right;\">\n",
138
+ " <th></th>\n",
139
+ " <th>SELF_STUDY_DEFAULTS key</th>\n",
140
+ " <th>Prompt</th>\n",
141
+ " </tr>\n",
142
+ " </thead>\n",
143
+ " <tbody>\n",
144
+ " <tr>\n",
145
+ " <th>0</th>\n",
146
+ " <td>mc</td>\n",
147
+ " <td>Please design a 5 question multiple choice quiz about the provided text.</td>\n",
148
+ " </tr>\n",
149
+ " <tr>\n",
150
+ " <th>1</th>\n",
151
+ " <td>short_answer</td>\n",
152
+ " <td>Please design a 5 question short answer quiz about the provided text. The question types should be short answer. Expect the correct answers to be a few sentences long.</td>\n",
153
+ " </tr>\n",
154
+ " <tr>\n",
155
+ " <th>2</th>\n",
156
+ " <td>fill_blank</td>\n",
157
+ " <td>Create a 5 question fill in the blank quiz referencing parts of the provided text.\\nThe \"blank\" part of the question should appear as \"________\". The answers should reflect what word(s) should go in the blank an accurate statement.\\nAn example is as follows: \"The author of the book is ______.\" The question should be a statement.\\n</td>\n",
158
+ " </tr>\n",
159
+ " <tr>\n",
160
+ " <th>3</th>\n",
161
+ " <td>sequencing</td>\n",
162
+ " <td>Create a 5 question questionnaire that will ask me to recall the steps or sequence of events\\nin the provided text.</td>\n",
163
+ " </tr>\n",
164
+ " <tr>\n",
165
+ " <th>4</th>\n",
166
+ " <td>relationships</td>\n",
167
+ " <td>Create a 5 question quiz for the student that asks the student to identify relationships betweentopics or concepts that are important to understanding this text.</td>\n",
168
+ " </tr>\n",
169
+ " <tr>\n",
170
+ " <th>5</th>\n",
171
+ " <td>concepts</td>\n",
172
+ " <td>Design a 5 question quiz that asks me about definitions or concepts of importance in the provided text.</td>\n",
173
+ " </tr>\n",
174
+ " <tr>\n",
175
+ " <th>6</th>\n",
176
+ " <td>real_world_example</td>\n",
177
+ " <td>Demonstrate how the provided context can be applied to solve a real world problem.\\nAsk me questions about how the demonstration you provided relates to solving a real world problem.</td>\n",
178
+ " </tr>\n",
179
+ " <tr>\n",
180
+ " <th>7</th>\n",
181
+ " <td>randomized_questions</td>\n",
182
+ " <td>Generate a high-quality assessment consisting of 5 varied questions,\\neach of different types (open-ended, multiple choice, short answer, analogies, etc.)</td>\n",
183
+ " </tr>\n",
184
+ " </tbody>\n",
185
+ "</table>\n",
186
+ "</div>"
187
+ ],
188
+ "text/plain": [
189
+ " SELF_STUDY_DEFAULTS key \\\n",
190
+ "0 mc \n",
191
+ "1 short_answer \n",
192
+ "2 fill_blank \n",
193
+ "3 sequencing \n",
194
+ "4 relationships \n",
195
+ "5 concepts \n",
196
+ "6 real_world_example \n",
197
+ "7 randomized_questions \n",
198
+ "\n",
199
+ " Prompt \n",
200
+ "0 Please design a 5 question multiple choice quiz about the provided text. \n",
201
+ "1 Please design a 5 question short answer quiz about the provided text. The question types should be short answer. Expect the correct answers to be a few sentences long. \n",
202
+ "2 Create a 5 question fill in the blank quiz referencing parts of the provided text.\\nThe \"blank\" part of the question should appear as \"________\". The answers should reflect what word(s) should go in the blank an accurate statement.\\nAn example is as follows: \"The author of the book is ______.\" The question should be a statement.\\n \n",
203
+ "3 Create a 5 question questionnaire that will ask me to recall the steps or sequence of events\\nin the provided text. \n",
204
+ "4 Create a 5 question quiz for the student that asks the student to identify relationships betweentopics or concepts that are important to understanding this text. \n",
205
+ "5 Design a 5 question quiz that asks me about definitions or concepts of importance in the provided text. \n",
206
+ "6 Demonstrate how the provided context can be applied to solve a real world problem.\\nAsk me questions about how the demonstration you provided relates to solving a real world problem. \n",
207
+ "7 Generate a high-quality assessment consisting of 5 varied questions,\\neach of different types (open-ended, multiple choice, short answer, analogies, etc.) "
208
+ ]
209
+ },
210
+ "metadata": {},
211
+ "output_type": "display_data"
212
+ }
213
+ ],
214
+ "source": [
215
+ "# show all prompts and names\n",
216
+ "ssp.print_all_self_study_prompts()"
217
+ ]
218
+ },
219
+ {
220
+ "cell_type": "code",
221
+ "execution_count": null,
222
+ "metadata": {},
223
+ "outputs": [
224
+ {
225
+ "data": {
226
+ "text/plain": [
227
+ "'Please design a 5 question multiple choice quiz about the provided text.'"
228
+ ]
229
+ },
230
+ "execution_count": null,
231
+ "metadata": {},
232
+ "output_type": "execute_result"
233
+ }
234
+ ],
235
+ "source": [
236
+ "# accessing texts of desired assessment types\n",
237
+ "ssp.SELF_STUDY_DEFAULTS['mc']"
238
+ ]
239
+ },
240
+ {
241
+ "cell_type": "markdown",
242
+ "metadata": {},
243
+ "source": [
244
+ "# Add your context and assign the prefix to your query.\n",
245
+ "The query assigned here serves as an example."
246
+ ]
247
+ },
248
+ {
249
+ "cell_type": "code",
250
+ "execution_count": null,
251
+ "metadata": {},
252
+ "outputs": [],
253
+ "source": [
254
+ "context = \"\"\" Two roads diverged in a yellow wood,\n",
255
+ "And sorry I could not travel both\n",
256
+ "And be one traveler, long I stood\n",
257
+ "And looked down one as far as I could\n",
258
+ "To where it bent in the undergrowth;\n",
259
+ "Then took the other, as just as fair,\n",
260
+ "And having perhaps the better claim,\n",
261
+ "Because it was grassy and wanted wear;\n",
262
+ "Though as for that the passing there\n",
263
+ "Had worn them really about the same,\n",
264
+ "And both that morning equally lay\n",
265
+ "In leaves no step had trodden black.\n",
266
+ "Oh, I kept the first for another day!\n",
267
+ "Yet knowing how way leads on to way,\n",
268
+ "I doubted if I should ever come back.\n",
269
+ "I shall be telling this with a sigh\n",
270
+ "Somewhere ages and ages hence:\n",
271
+ "Two roads diverged in a wood, and I—\n",
272
+ "I took the one less traveled by,\n",
273
+ "And that has made all the difference.\n",
274
+ "—-Robert Frost—-\n",
275
+ "Education Place: http://www.eduplace.com \"\"\""
276
+ ]
277
+ },
278
+ {
279
+ "cell_type": "markdown",
280
+ "metadata": {},
281
+ "source": [
282
+ "# A guide to prompting for self-study\n",
283
+ "In this section, we provide a number of different approaches for using AI to help you assess and explain the knowledge of your document. Start by interacting with the model and then try out the rest of the prompts!"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "markdown",
288
+ "metadata": {},
289
+ "source": [
290
+ "## Interact with the model\n",
291
+ "\n",
292
+ "Now that your context is created, you can begin interacting with the model! Below, we have a comprehensive list of examples using different question types, but feel free to use this code block to experiment with the model.\n",
293
+ "\n",
294
+ "First, let's make the settings for the query. In other words, what are the learning objectives and what is the type of assessment we want to have?"
295
+ ]
296
+ },
297
+ {
298
+ "cell_type": "code",
299
+ "execution_count": null,
300
+ "metadata": {},
301
+ "outputs": [],
302
+ "source": [
303
+ "# set short answer as the assessment type\n",
304
+ "assessment_type = ssp.SELF_STUDY_DEFAULTS[\"short_answer\"]\n",
305
+ "\n",
306
+ "# set learning objectives if desired\n",
307
+ "learning_objs = (\"\"\"1. Identify the key elements of the poem: narrator, setting, and underlying message.\n",
308
+ " 2. Understand the literary devices used in poetry and their purposes.\"\"\")"
309
+ ]
310
+ },
311
+ {
312
+ "cell_type": "markdown",
313
+ "metadata": {},
314
+ "source": [
315
+ "Next, let's use the predefined defaults with the model and provided APIs from `ai_classroom_suite`."
316
+ ]
317
+ },
318
+ {
319
+ "cell_type": "code",
320
+ "execution_count": null,
321
+ "metadata": {},
322
+ "outputs": [
323
+ {
324
+ "data": {
325
+ "text/plain": [
326
+ "'Question 1: Who is the narrator of the poem and what is the setting?\\n\\nPlease provide your answer in a few sentences.'"
327
+ ]
328
+ },
329
+ "execution_count": null,
330
+ "metadata": {},
331
+ "output_type": "execute_result"
332
+ }
333
+ ],
334
+ "source": [
335
+ "# Ask the tutor to prompt you based on the text\n",
336
+ "get_tutoring_answer(context, mdl, assessment_request = assessment_type,\n",
337
+ " learning_objectives = learning_objs)"
338
+ ]
339
+ },
340
+ {
341
+ "cell_type": "markdown",
342
+ "metadata": {},
343
+ "source": [
344
+ "The complete prompt sent is based on pre-generated information and can be seen below. Your context, assessment_type, and learning objectives are substituted to create a full prompt as shown below."
345
+ ]
346
+ },
347
+ {
348
+ "cell_type": "code",
349
+ "execution_count": null,
350
+ "metadata": {},
351
+ "outputs": [
352
+ {
353
+ "name": "stdout",
354
+ "output_type": "stream",
355
+ "text": [
356
+ "System: You are a world-class tutor helping students to perform better on oral and written exams though interactive experiences. When assessing and evaluating students, you always ask one question at a time, and wait for the student's response before providing them with feedback. Asking one question at a time, waiting for the student's response, and then commenting on the strengths and weaknesses of their responses (when appropriate) is what makes you such a sought-after, world-class tutor.\n",
357
+ "Human: I'm trying to better understand the text provided below. Please design a 5 question short answer quiz about the provided text. The question types should be short answer. Expect the correct answers to be a few sentences long. The learning objectives to be assessed are: 1. Identify the key elements of the poem: narrator, setting, and underlying message.\n",
358
+ " 2. Understand the literary devices used in poetry and their purposes.. Although I may request more than one assessment question, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right. The text that you will base your questions on is as follows: Two roads diverged in a yellow wood,\n",
359
+ "And sorry I could not travel both\n",
360
+ "And be one traveler, long I stood\n",
361
+ "And looked down one as far as I could\n",
362
+ "To where it bent in the undergrowth;\n",
363
+ "Then took the other, as just as fair,\n",
364
+ "And having perhaps the better claim,\n",
365
+ "Because it was grassy and wanted wear;\n",
366
+ "Though as for that the passing there\n",
367
+ "Had worn them really about the same,\n",
368
+ "And both that morning equally lay\n",
369
+ "In leaves no step had trodden black.\n",
370
+ "Oh, I kept the first for another day!\n",
371
+ "Yet knowing how way leads on to way,\n",
372
+ "I doubted if I should ever come back.\n",
373
+ "I shall be telling this with a sigh\n",
374
+ "Somewhere ages and ages hence:\n",
375
+ "Two roads diverged in a wood, and I—\n",
376
+ "I took the one less traveled by,\n",
377
+ "And that has made all the difference.\n",
378
+ "—-Robert Frost—-\n",
379
+ "Education Place: http://www.eduplace.com .\n"
380
+ ]
381
+ }
382
+ ],
383
+ "source": [
384
+ "# Use different function to create the prompt\n",
385
+ "full_prompt = get_tutoring_prompt(context, assessment_request = assessment_type,\n",
386
+ " learning_objectives = learning_objs)\n",
387
+ "\n",
388
+ "# Show the prompt as a string\n",
389
+ "print(full_prompt.to_string())"
390
+ ]
391
+ },
392
+ {
393
+ "cell_type": "markdown",
394
+ "metadata": {},
395
+ "source": [
396
+ "Alternately, you can define your own prompt, which you'll as appropriate. To modify the kind of assessment you'll be asking for, change `assessment_request`. An example of how to add more context to the model is shown below as well."
397
+ ]
398
+ },
399
+ {
400
+ "cell_type": "code",
401
+ "execution_count": null,
402
+ "metadata": {},
403
+ "outputs": [
404
+ {
405
+ "data": {
406
+ "text/plain": [
407
+ "\"Question: Who is the narrator in the poem and what is the underlying message conveyed?\\n\\nHint: Pay attention to the pronouns used throughout the poem to determine the narrator's identity. Additionally, think about the choices made by the narrator and the impact those choices have on their life. The underlying message is related to the consequences of these choices.\\n\\nTake your time to reflect on the text and provide your answer when you're ready.\""
408
+ ]
409
+ },
410
+ "execution_count": null,
411
+ "metadata": {},
412
+ "output_type": "execute_result"
413
+ }
414
+ ],
415
+ "source": [
416
+ "# Use your own texts\n",
417
+ "custom_request = (\"Ask me a short answer question about the provided text. The questions you ask should allow\"\n",
418
+ " \" me to demonstrate my creativity, capacity for out-of-the-box thinking, insights, and deeper meaning \"\n",
419
+ " \" of the text.\")\n",
420
+ "additional_context = (\"This is a text written by Robert Frost, a famous American poet. The text is widely studied in K-12 literature\"\n",
421
+ " \" education courses, and should be read with an eye towards the philosophical and human themes of the text.\")\n",
422
+ "\n",
423
+ "# Concatenate context\n",
424
+ "informed_context = context + \"\\n Additional information about the text is: \" + additional_context \n",
425
+ "\n",
426
+ "# Use custom_request defined as the assessment request\n",
427
+ "get_tutoring_answer(informed_context, mdl, assessment_request = custom_request,\n",
428
+ " learning_objectives = learning_objs)"
429
+ ]
430
+ },
431
+ {
432
+ "cell_type": "markdown",
433
+ "metadata": {},
434
+ "source": [
435
+ "## Types of Questions and Prompts\n",
436
+ "\n",
437
+ "Below is a comprehensive list of question types and prompt templates designed by our team. There are also example code blocks, where you can see how the model performed with the example and try it for yourself using the prompt template."
438
+ ]
439
+ },
440
+ {
441
+ "cell_type": "markdown",
442
+ "metadata": {},
443
+ "source": [
444
+ "### Multiple Choice\n",
445
+ "\n",
446
+ "Prompt: The following text should be used as the basis for the instructions which follow: {context}. Please design a {number of questions} question quiz about {name or reference to context} which reflects the learning objectives: {list of learning objectives}. The questions should be multiple choice. Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
447
+ ]
448
+ },
449
+ {
450
+ "cell_type": "code",
451
+ "execution_count": null,
452
+ "metadata": {},
453
+ "outputs": [
454
+ {
455
+ "name": "stdout",
456
+ "output_type": "stream",
457
+ "text": [
458
+ "Question 1: Who is the narrator of the poem?\n",
459
+ "\n",
460
+ "A) Robert Frost\n",
461
+ "B) The traveler \n",
462
+ "C) The undergrowth \n",
463
+ "D) The wood\n",
464
+ "\n",
465
+ "Please provide your answer.\n"
466
+ ]
467
+ }
468
+ ],
469
+ "source": [
470
+ "# Multiple choice code example\n",
471
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['mc'],\n",
472
+ " learning_objectives = learning_objs)\n",
473
+ "print(tutor_q)"
474
+ ]
475
+ },
476
+ {
477
+ "cell_type": "markdown",
478
+ "metadata": {},
479
+ "source": [
480
+ "### Short Answer\n",
481
+ "\n",
482
+ "Prompt: Please design a {number of questions} question quiz about {context} which reflects the learning objectives: {list of learning objectives}. The questions should be short answer. Expect the correct answers to be {anticipated length} long. Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
483
+ ]
484
+ },
485
+ {
486
+ "cell_type": "code",
487
+ "execution_count": null,
488
+ "metadata": {},
489
+ "outputs": [
490
+ {
491
+ "name": "stdout",
492
+ "output_type": "stream",
493
+ "text": [
494
+ "Question 1: Who is the narrator of the poem and what is the setting?\n",
495
+ "\n",
496
+ "Remember to answer the question by identifying the narrator of the poem and describing the setting in which the events take place.\n"
497
+ ]
498
+ }
499
+ ],
500
+ "source": [
501
+ "# Short answer code example\n",
502
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['short_answer'],\n",
503
+ " learning_objectives = learning_objs)\n",
504
+ "print(tutor_q)"
505
+ ]
506
+ },
507
+ {
508
+ "cell_type": "markdown",
509
+ "metadata": {},
510
+ "source": [
511
+ "### Fill-in-the-blank\n",
512
+ "\n",
513
+ "Prompt: Create a {number of questions} question fill in the blank quiz refrencing {context}. The quiz should reflect the learning objectives: {learning objectives}. The \"blank\" part of the question should appear as \"________\". The answers should reflect what word(s) should go in the blank an accurate statement.\n",
514
+ "\n",
515
+ "An example is the follow: \"The author of the book is \"________.\"\n",
516
+ "\n",
517
+ "The question should be a statement. Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
518
+ ]
519
+ },
520
+ {
521
+ "cell_type": "code",
522
+ "execution_count": null,
523
+ "metadata": {},
524
+ "outputs": [
525
+ {
526
+ "name": "stdout",
527
+ "output_type": "stream",
528
+ "text": [
529
+ "Question 1: The poem \"The Road Not Taken\" was written by ________.\n",
530
+ "\n",
531
+ "Question 2: What is the color of the wood where the two roads diverged? \n",
532
+ "\n",
533
+ "Question 3: What is the reason the narrator gives for choosing the second road?\n",
534
+ "\n",
535
+ "Question 4: What does the narrator say about the wear of both roads?\n",
536
+ "\n",
537
+ "Question 5: According to the poem, what has made all the difference in the narrator's life?\n",
538
+ "\n",
539
+ "Remember to wait for the student's response before providing feedback.\n"
540
+ ]
541
+ }
542
+ ],
543
+ "source": [
544
+ "# Fill in the blank code example\n",
545
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['fill_blank'],\n",
546
+ " learning_objectives = learning_objs)\n",
547
+ "print(tutor_q)"
548
+ ]
549
+ },
550
+ {
551
+ "cell_type": "markdown",
552
+ "metadata": {},
553
+ "source": [
554
+ "### Sequencing\n",
555
+ "\n",
556
+ "Prompt: Please develop a {number of questions} question questionnaire that will ask me to recall the steps involved in the following learning objectives in regard to {context}: {learning objectives}. Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional chances to respond until I get the correct choice. After I respond, explain their sequence to me."
557
+ ]
558
+ },
559
+ {
560
+ "cell_type": "code",
561
+ "execution_count": null,
562
+ "metadata": {},
563
+ "outputs": [
564
+ {
565
+ "name": "stdout",
566
+ "output_type": "stream",
567
+ "text": [
568
+ "Question 1: Who is the narrator of the poem?\n",
569
+ "\n",
570
+ "Question 2: What is the setting of the poem?\n",
571
+ "\n",
572
+ "Question 3: What is the underlying message of the poem?\n",
573
+ "\n",
574
+ "Question 4: What literary device is used in the line \"Two roads diverged in a yellow wood\"?\n",
575
+ "\n",
576
+ "Question 5: What is the purpose of using the literary device in question 4?\n",
577
+ "\n",
578
+ "Please answer question 1 first.\n"
579
+ ]
580
+ }
581
+ ],
582
+ "source": [
583
+ "# Sequence example\n",
584
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['sequencing'],\n",
585
+ " learning_objectives = learning_objs)\n",
586
+ "\n",
587
+ "print(tutor_q)"
588
+ ]
589
+ },
590
+ {
591
+ "cell_type": "markdown",
592
+ "metadata": {},
593
+ "source": [
594
+ "### Relationships/drawing connections\n",
595
+ "\n",
596
+ "Prompt: Please design a {number of questions} question quiz that asks me to explain the relationships that exist within the following learning objectives, referencing {context}: {learning objectives}. Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
597
+ ]
598
+ },
599
+ {
600
+ "cell_type": "code",
601
+ "execution_count": null,
602
+ "metadata": {},
603
+ "outputs": [
604
+ {
605
+ "name": "stdout",
606
+ "output_type": "stream",
607
+ "text": [
608
+ "Question 1: Who is the narrator of the poem?\n",
609
+ "\n",
610
+ "Question 2: What is the setting of the poem?\n",
611
+ "\n",
612
+ "Question 3: What is the underlying message of the poem?\n",
613
+ "\n",
614
+ "Question 4: What literary device is used when the narrator says, \"Two roads diverged in a yellow wood\"?\n",
615
+ "\n",
616
+ "Question 5: What literary device is used when the narrator says, \"I took the one less traveled by, And that has made all the difference\"?\n"
617
+ ]
618
+ }
619
+ ],
620
+ "source": [
621
+ "# Relationships example\n",
622
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['relationships'],\n",
623
+ " learning_objectives = learning_objs)\n",
624
+ "\n",
625
+ "print(tutor_q)"
626
+ ]
627
+ },
628
+ {
629
+ "cell_type": "markdown",
630
+ "metadata": {},
631
+ "source": [
632
+ "### Concepts and Definitions\n",
633
+ "\n",
634
+ "Prompt: Design a {number of questions} question quiz that asks me about definitions related to the following learning objectives: {learning objectives} - based on {context}\".\n",
635
+ "Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right.\n"
636
+ ]
637
+ },
638
+ {
639
+ "cell_type": "code",
640
+ "execution_count": null,
641
+ "metadata": {},
642
+ "outputs": [
643
+ {
644
+ "name": "stdout",
645
+ "output_type": "stream",
646
+ "text": [
647
+ "Question 1: Who is the narrator of the poem? \n",
648
+ "\n",
649
+ "Remember, the narrator is the person who is speaking or telling the story.\n"
650
+ ]
651
+ }
652
+ ],
653
+ "source": [
654
+ "# Concepts and definitions example\n",
655
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['concepts'],\n",
656
+ " learning_objectives = learning_objs)\n",
657
+ "\n",
658
+ "print(tutor_q)"
659
+ ]
660
+ },
661
+ {
662
+ "cell_type": "markdown",
663
+ "metadata": {},
664
+ "source": [
665
+ "### Real Word Examples\n",
666
+ "\n",
667
+ "Prompt: Demonstrate how {context} can be applied to solve a real-world problem related to the following learning objectives: {learning objectives}. Ask me questions regarding this theory/concept.\n",
668
+ "\n",
669
+ "Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
670
+ ]
671
+ },
672
+ {
673
+ "cell_type": "code",
674
+ "execution_count": null,
675
+ "metadata": {},
676
+ "outputs": [
677
+ {
678
+ "name": "stdout",
679
+ "output_type": "stream",
680
+ "text": [
681
+ "Question 1: Who is the narrator of the poem?\n",
682
+ "\n",
683
+ "Question 2: What is the setting of the poem?\n",
684
+ "\n",
685
+ "Question 3: What is the underlying message of the poem?\n",
686
+ "\n",
687
+ "Remember to provide your answer to one question at a time.\n"
688
+ ]
689
+ }
690
+ ],
691
+ "source": [
692
+ "# Real word example\n",
693
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['real_world_example'],\n",
694
+ " learning_objectives = learning_objs)\n",
695
+ "\n",
696
+ "print(tutor_q)"
697
+ ]
698
+ },
699
+ {
700
+ "cell_type": "markdown",
701
+ "metadata": {},
702
+ "source": [
703
+ "### Randomized Question Types\n",
704
+ "\n",
705
+ "Prompt: Please generate a high-quality assessment consisting of {number of questions} varying questions, each of different types (open-ended, multiple choice, etc.), to determine if I achieved the following learning objectives in regards to {context}: {learning objectives}.\n",
706
+ "\n",
707
+ "Provide one question at a time, and wait for my response before providing me with feedback. Again, while the quiz may ask for multiple questions, you should only provide ONE question in you initial response. Do not include the answer in your response. If I get an answer wrong, provide me with an explanation of why it was incorrect,and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
708
+ ]
709
+ },
710
+ {
711
+ "cell_type": "code",
712
+ "execution_count": null,
713
+ "metadata": {},
714
+ "outputs": [
715
+ {
716
+ "name": "stdout",
717
+ "output_type": "stream",
718
+ "text": [
719
+ "Question 1 (Open-ended): Who is the narrator of the poem and what is the setting?\n",
720
+ "\n",
721
+ "Question 2 (Multiple choice): Which literary device is used in the line \"And sorry I could not travel both\"?\n",
722
+ "\n",
723
+ "a) Simile\n",
724
+ "b) Metaphor\n",
725
+ "c) Alliteration\n",
726
+ "d) Personification\n",
727
+ "\n",
728
+ "Question 3 (Short answer): Describe the underlying message of the poem in one sentence.\n",
729
+ "\n",
730
+ "Question 4 (Analogies): Complete the analogy: \"The two roads diverged in a yellow wood\" is to the physical setting as \"I took the one less traveled by\" is to ___________.\n",
731
+ "\n",
732
+ "Question 5 (Open-ended): Identify and explain one additional literary device used in the poem and its purpose.\n",
733
+ "\n",
734
+ "Please choose one question from above for me to provide a detailed evaluation.\n"
735
+ ]
736
+ }
737
+ ],
738
+ "source": [
739
+ "# Randomized question types\n",
740
+ "tutor_q = get_tutoring_answer(context, mdl, assessment_request = ssp.SELF_STUDY_DEFAULTS['randomized_questions'],\n",
741
+ " learning_objectives = learning_objs)\n",
742
+ "\n",
743
+ "print(tutor_q)"
744
+ ]
745
+ },
746
+ {
747
+ "cell_type": "markdown",
748
+ "metadata": {},
749
+ "source": [
750
+ "### Quantiative evaluation the correctness of a student's answer\n",
751
+ "\n",
752
+ "Prompt: (A continuation of the previous chat) Please generate the main points of the student’s answer to the previous question, and evaluate on a scale of 1 to 5 how comprehensive the student’s answer was in relation to the learning objectives, and explain why he or she received this rating, including what was missed in his or her answer if the student’s answer wasn’t complete.\n"
753
+ ]
754
+ },
755
+ {
756
+ "cell_type": "code",
757
+ "execution_count": null,
758
+ "metadata": {},
759
+ "outputs": [
760
+ {
761
+ "name": "stdout",
762
+ "output_type": "stream",
763
+ "text": [
764
+ "Based on the provided text, the student's answer to the previous question was not provided. Therefore, I cannot generate the main points of the student's answer or evaluate its comprehensiveness in relation to the learning objectives. Please provide the student's answer to the previous question so that I can assist you further.\n"
765
+ ]
766
+ }
767
+ ],
768
+ "source": [
769
+ "# qualitative evaluation\n",
770
+ "qualitative_query = \"\"\" Please generate the main points of the student’s answer to the previous question,\n",
771
+ " and evaluate on a scale of 1 to 5 how comprehensive the student’s answer was in relation to the learning objectives,\n",
772
+ " and explain why he or she received this rating, including what was missed in his or her answer if the student’s answer wasn’t complete.\"\"\"\n",
773
+ "\n",
774
+ "# Note that this uses the previous result and query in the context\n",
775
+ "last_answer = ''\n",
776
+ "\n",
777
+ "# Get result with formatting to emphasize changes in parameter inputs\n",
778
+ "result = get_tutoring_answer(last_answer + context,\n",
779
+ " mdl,\n",
780
+ " assessment_request = qualitative_query,\n",
781
+ " learning_objectives = learning_objs)\n",
782
+ "\n",
783
+ "print(result)"
784
+ ]
785
+ }
786
+ ],
787
+ "metadata": {
788
+ "kernelspec": {
789
+ "display_name": "python3",
790
+ "language": "python",
791
+ "name": "python3"
792
+ }
793
+ },
794
+ "nbformat": 4,
795
+ "nbformat_minor": 0
796
+ }
prompt_with_vector_store.ipynb ADDED
@@ -0,0 +1,637 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "<a href=\"https://colab.research.google.com/github/vanderbilt-data-science/lo-achievement/blob/main/prompt_with_vector_store.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "# LLMs for Self-Study\n",
15
+ "> A prompt and code template for better understanding texts\n",
16
+ "\n",
17
+ "This notebook provides a guide for using LLMs for self-study programmatically. A number of prompt templates are provided to assist with generating great assessments for self-study, and code is additionally provided for fast usage. This notebook is best leveraged for a set of documents (text or PDF preferred) **to be uploaded** for interaction with the model.\n",
18
+ "\n",
19
+ "This version of the notebook is best suited for those who prefer to use files from their local drive as context rather than copy and pasting directly into the notebook to be used as context for the model. If you prefer to copy and paste text, you should direct yourself to the [prompt_with_context](https://colab.research.google.com/github/vanderbilt-data-science/lo-achievement/blob/main/prompt_with_context.ipynb) notebook."
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "raw",
24
+ "metadata": {},
25
+ "source": [
26
+ "---\n",
27
+ "skip_exec: true\n",
28
+ "---"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": null,
34
+ "metadata": {},
35
+ "outputs": [],
36
+ "source": [
37
+ "# run this code if you're using Google Colab or don't have these packages installed in your computing environment\n",
38
+ "! pip install pip install git+https://<token>@github.com/vanderbilt-data-science/lo-achievement.git"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "code",
43
+ "execution_count": null,
44
+ "metadata": {},
45
+ "outputs": [],
46
+ "source": [
47
+ "#libraries for user setup code\n",
48
+ "from getpass import getpass\n",
49
+ "from logging import raiseExceptions\n",
50
+ "\n",
51
+ "#self import code\n",
52
+ "from ai_classroom_suite.PromptInteractionBase import *\n",
53
+ "from ai_classroom_suite.IOHelperUtilities import *\n",
54
+ "from ai_classroom_suite.SelfStudyPrompts import *\n",
55
+ "from ai_classroom_suite.MediaVectorStores import *"
56
+ ]
57
+ },
58
+ {
59
+ "cell_type": "markdown",
60
+ "metadata": {},
61
+ "source": [
62
+ "# User Settings\n",
63
+ "In this section, you'll set your OpenAI API Key (for use with the OpenAI model), configure your environment/files for upload, and upload those files."
64
+ ]
65
+ },
66
+ {
67
+ "cell_type": "code",
68
+ "execution_count": null,
69
+ "metadata": {},
70
+ "outputs": [],
71
+ "source": [
72
+ "# Run this cell and enter your OpenAI API key when prompted\n",
73
+ "set_openai_key()"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "metadata": {},
80
+ "outputs": [],
81
+ "source": [
82
+ "# Create model\n",
83
+ "mdl_name = 'gpt-3.5-turbo-16k'\n",
84
+ "chat_llm = create_model(mdl_name)"
85
+ ]
86
+ },
87
+ {
88
+ "cell_type": "markdown",
89
+ "metadata": {},
90
+ "source": [
91
+ "## Define Your Documents Source\n",
92
+ "You may upload your files directly from your computer, or you may choose to do so via your Google Drive. Below, you will find instructions for both methods.\n",
93
+ "\n",
94
+ "For either model, begin by setting the `upload_setting` variable to:\n",
95
+ "* `'Local Drive'` - if you have files that are on your own computer (locally), or\n",
96
+ "* `'Google Drive'` - if you have files that are stored on Google Drive\n",
97
+ "\n",
98
+ "e.g.,\n",
99
+ "`upload_setting='Google Drive'`.\n",
100
+ "Don't forget the quotes around your selection!"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "metadata": {},
107
+ "outputs": [],
108
+ "source": [
109
+ "## Settings for upload: via local drive or Google Drive\n",
110
+ "### Please input either \"Google Drive\" or \"Local Drive\" into the empty string\n",
111
+ "\n",
112
+ "#upload_setting = 'Google Drive'\n",
113
+ "upload_setting = 'Local Drive'"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "markdown",
118
+ "metadata": {},
119
+ "source": [
120
+ "<p style='color:green'><strong>Before Continuing</strong> - Make sure you have input your choice of upload into the `upload_setting`` variable above (Options: \"Local Drive\" or \"Google Drive\") as described in the above instructions.</p>"
121
+ ]
122
+ },
123
+ {
124
+ "cell_type": "markdown",
125
+ "metadata": {},
126
+ "source": [
127
+ "## Upload your Files\n",
128
+ "Now, you'll upload your files. When you run the below code cell, you'll be able to follow the instructions for local or Google Drive upload described here. If you would like to use our example document (Robert Frost's \"The Road Not Taken\", you can download the file from [this link](https://drive.google.com/drive/folders/1wpEoGACUqyNRYa4zBZeNkqcLJrGQbA53?usp=sharing) and upload via the instructions above.\n",
129
+ "\n",
130
+ "**If you selected **\"Local Drive\"** :**\n",
131
+ "> If you selected Local Drive, you'll need to start by selecting your local files. Run the code cell below. Once the icon appears, click the \"Choose File\". This will direct you to your computer's local drive. Select the file you would like to upload as context. The files will appear in the right sidebar. Then follow the rest of the steps in the \"Uploading Your files (Local Drive and Google Drive)\" below.\n",
132
+ "\n",
133
+ "**If you selected **\"Google Drive\"**: **\n",
134
+ "> If you selected Google Drive, you'll need to start by allowing access to your Google Drive. Run the code cell below. You will be redirected to a window where you will allow access to your Google Drive by logging into your Google Account. Your Drive will appear as a folder in the left side panel. Navigate through your Google Drive until you've found the file that you'd like to upload.\n",
135
+ "\n",
136
+ "Your files are now accessible to the code."
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "metadata": {},
143
+ "outputs": [
144
+ {
145
+ "data": {
146
+ "application/vnd.jupyter.widget-view+json": {
147
+ "model_id": "e10a33b291a14f8089a1dea89f872998",
148
+ "version_major": 2,
149
+ "version_minor": 0
150
+ },
151
+ "text/plain": [
152
+ "FileChooser(path='/workspaces/lo-achievement', filename='', title='Use the following file chooser to add each …"
153
+ ]
154
+ },
155
+ "metadata": {},
156
+ "output_type": "display_data"
157
+ },
158
+ {
159
+ "data": {
160
+ "application/vnd.jupyter.widget-view+json": {
161
+ "model_id": "4649cff2ba0942aa9c5e073be85f40fb",
162
+ "version_major": 2,
163
+ "version_minor": 0
164
+ },
165
+ "text/plain": [
166
+ "Output()"
167
+ ]
168
+ },
169
+ "metadata": {},
170
+ "output_type": "display_data"
171
+ }
172
+ ],
173
+ "source": [
174
+ "# Run this cell then following the instructions to upload your file\n",
175
+ "selected_files = setup_drives(upload_setting)"
176
+ ]
177
+ },
178
+ {
179
+ "cell_type": "code",
180
+ "execution_count": null,
181
+ "metadata": {},
182
+ "outputs": [
183
+ {
184
+ "data": {
185
+ "text/plain": [
186
+ "['/workspaces/lo-achievement/roadnottaken.txt']"
187
+ ]
188
+ },
189
+ "execution_count": null,
190
+ "metadata": {},
191
+ "output_type": "execute_result"
192
+ }
193
+ ],
194
+ "source": [
195
+ "selected_files"
196
+ ]
197
+ },
198
+ {
199
+ "cell_type": "markdown",
200
+ "metadata": {},
201
+ "source": [
202
+ "# Resource and Personal Tutor Creation\n",
203
+ "Congratulations! You've nearly finished with the setup! From here, you can now run this section of cells using the arrow to the left to set up your vector store and create your model."
204
+ ]
205
+ },
206
+ {
207
+ "cell_type": "markdown",
208
+ "metadata": {},
209
+ "source": [
210
+ "## Create a vector store with your document\n",
211
+ "\n",
212
+ "With the file path, you can now create a vector store using the document that you uploaded. We expose this creation in case you want to modify the kind of vector store that you're creating. Run the cell below to create the default provided vector store."
213
+ ]
214
+ },
215
+ {
216
+ "cell_type": "code",
217
+ "execution_count": null,
218
+ "metadata": {},
219
+ "outputs": [],
220
+ "source": [
221
+ "# Create vector store\n",
222
+ "doc_segments = get_document_segments(selected_files, data_type = 'files')\n",
223
+ "chroma_db, vs_retriever = create_local_vector_store(doc_segments, search_kwargs={\"k\": 1})"
224
+ ]
225
+ },
226
+ {
227
+ "cell_type": "markdown",
228
+ "metadata": {},
229
+ "source": [
230
+ "## Create the model which will do the vector store lookup and tutoring"
231
+ ]
232
+ },
233
+ {
234
+ "cell_type": "code",
235
+ "execution_count": null,
236
+ "metadata": {},
237
+ "outputs": [],
238
+ "source": [
239
+ "# Create retrieval chain\n",
240
+ "qa_chain = create_tutor_mdl_chain(kind=\"retrieval_qa\", retriever = vs_retriever)"
241
+ ]
242
+ },
243
+ {
244
+ "cell_type": "markdown",
245
+ "metadata": {},
246
+ "source": [
247
+ "# A guide to prompting for self-study\n",
248
+ "In this section, we provide a number of different approaches for using AI to help you assess and explain the knowledge of your document. Start by interacting with the model and then try out the rest of the prompts!"
249
+ ]
250
+ },
251
+ {
252
+ "cell_type": "markdown",
253
+ "metadata": {},
254
+ "source": [
255
+ "## Brief overview of tutoring code options\n",
256
+ "\n",
257
+ "Now that your vector store is created, you can begin interacting with the model! You will interact with the model with a vector store using the `get_tutoring_answer` function below, and details are provided regarding the functionality below.\n",
258
+ "\n",
259
+ "Consider the multiple choice code snippet:\n",
260
+ "```{python}\n",
261
+ "tutor_q = get_tutoring_answer(context = '',\n",
262
+ " qa_chain,\n",
263
+ " assessment_request = SELF_STUDY_DEFAULTS['mc'],\n",
264
+ " learning_objectives = learning_objs,\n",
265
+ " input_kwargs = {'question':topic})\n",
266
+ "```\n",
267
+ "\n",
268
+ "This is how we're able to interact with the model for tutoring when using vector stores. The parameters are as follows:\n",
269
+ "\n",
270
+ "* `context` will be an empty string or you can also set it to `None`. This is because this field is automatically populated using the vector store retreiver.\n",
271
+ "* `qa_chain` is the model that you're using - we created this model chain a few cells above. \n",
272
+ "* `assessment_request` is your way of telling the model what kind of assessment you want. In the example above, we use some defaults provided for multiple choice. You can also insert your own text here. To learn more about these defaults, see the `prompt_with_context.ipynb` in the CLAS repo.\n",
273
+ "* `learning_objectives` are the learning objectives that you want to assess in a single paragraph string. You can set this to '' if you don't want to define any learning objectives. If you don't provide one, the model will use the default learning objectives.\n",
274
+ "* `input_kwargs` are additional inputs that we can define in the prompts. Above, you see that the keyword `question` is defined. `question` is the text used to retrieve relevant texts from the vector store. Above, we define a custom topic. If you were to omit this parameter, the model would use `assessment_request` as the text to retrieve relevant documents from the vector store. See the examples below for both scenarios.\n",
275
+ "\n"
276
+ ]
277
+ },
278
+ {
279
+ "cell_type": "markdown",
280
+ "metadata": {},
281
+ "source": [
282
+ "## Sample topics and learning objectives\n",
283
+ "\n",
284
+ "Below, we define a topic (used to retrieve documents from the vector store if provided) and learning objectives which will be used in the following examples. You can change these as needed for your purpose."
285
+ ]
286
+ },
287
+ {
288
+ "cell_type": "code",
289
+ "execution_count": null,
290
+ "metadata": {},
291
+ "outputs": [],
292
+ "source": [
293
+ "# Code topic\n",
294
+ "topic = 'The full text of the poem \"The Road Not Taken\" by Robert Frost'\n",
295
+ "\n",
296
+ "# set learning objectives if desired\n",
297
+ "learning_objs = (\"\"\"1. Identify the key elements of the work: important takeaways and underlying message.\n",
298
+ " 2. Understand the literary devices used in prompting and in literature and their purpose.\"\"\")"
299
+ ]
300
+ },
301
+ {
302
+ "cell_type": "markdown",
303
+ "metadata": {},
304
+ "source": [
305
+ "## Types of Questions and Prompts\n",
306
+ "\n",
307
+ "Below is a comprehensive list of question types and prompt templates designed by our team. There are also example code blocks, where you can see how the model performed with the example and try it for yourself using the prompt template."
308
+ ]
309
+ },
310
+ {
311
+ "cell_type": "markdown",
312
+ "metadata": {},
313
+ "source": [
314
+ "### Multiple Choice\n",
315
+ "\n",
316
+ "Prompt: The following text should be used as the basis for the instructions which follow: {context}. Please design a 5 question quiz about {name or reference to context} which reflects the learning objectives: {list of learning objectives}. The questions should be multiple choice. If I get an answer wrong, provide me with an explanation of why it was incorrect, and then give me additional chances to respond until I get the correct choice. Explain why the correct choice is right."
317
+ ]
318
+ },
319
+ {
320
+ "cell_type": "code",
321
+ "execution_count": null,
322
+ "metadata": {},
323
+ "outputs": [
324
+ {
325
+ "name": "stdout",
326
+ "output_type": "stream",
327
+ "text": [
328
+ "Question 1: What is the underlying message of the excerpt?\n",
329
+ "\n",
330
+ "A) The speaker regrets not being able to travel both roads.\n",
331
+ "B) The speaker believes that taking the less traveled road has made a significant impact on their life.\n",
332
+ "C) The speaker is unsure about which road to choose.\n",
333
+ "D) The speaker is fascinated by the beauty of the yellow wood.\n",
334
+ "\n",
335
+ "Please select one of the options (A, B, C, or D) and provide your answer.\n"
336
+ ]
337
+ }
338
+ ],
339
+ "source": [
340
+ "# Multiple choice code example\n",
341
+ "tutor_q = get_tutoring_answer('', qa_chain, assessment_request = SELF_STUDY_DEFAULTS['mc'],\n",
342
+ " learning_objectives = learning_objs, input_kwargs = {'question':topic})\n",
343
+ "\n",
344
+ "print(tutor_q)"
345
+ ]
346
+ },
347
+ {
348
+ "cell_type": "markdown",
349
+ "metadata": {},
350
+ "source": [
351
+ "### Short Answer\n",
352
+ "\n",
353
+ "Prompt: Please design a 5-question quiz about {context} which reflects the learning objectives: {list of learning objectives}. The questions should be short answer. Expect the correct answers to be {anticipated length} long. If I get any part of the answer wrong, provide me with an explanation of why it was incorrect, and then give me additional chances to respond until I get the correct choice."
354
+ ]
355
+ },
356
+ {
357
+ "cell_type": "code",
358
+ "execution_count": null,
359
+ "metadata": {},
360
+ "outputs": [
361
+ {
362
+ "name": "stdout",
363
+ "output_type": "stream",
364
+ "text": [
365
+ "Question 1: What is the underlying message of the poem?\n",
366
+ "\n",
367
+ "Remember to provide your answer in a few sentences.\n"
368
+ ]
369
+ }
370
+ ],
371
+ "source": [
372
+ "# Short answer code example\n",
373
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = SELF_STUDY_DEFAULTS['short_answer'],\n",
374
+ " learning_objectives = learning_objs, input_kwargs = {'question':topic})\n",
375
+ "\n",
376
+ "print(tutor_q)"
377
+ ]
378
+ },
379
+ {
380
+ "cell_type": "markdown",
381
+ "metadata": {},
382
+ "source": [
383
+ "### Fill-in-the-blank\n",
384
+ "\n",
385
+ "Prompt: Create a 5 question fill in the blank quiz refrencing {context}. The quiz should reflect the learning objectives: {learning objectives}. Please prompt me one question at a time and proceed when I answer correctly. If I answer incorrectly, please explain why my answer is incorrect.\n",
386
+ "\n",
387
+ ":::{.callout-info}\n",
388
+ "In the example below, we omit the `input_kwargs` parameter. This means we'll use the text from `assessment_request` as the question topic.\n",
389
+ ":::"
390
+ ]
391
+ },
392
+ {
393
+ "cell_type": "code",
394
+ "execution_count": null,
395
+ "metadata": {},
396
+ "outputs": [
397
+ {
398
+ "name": "stdout",
399
+ "output_type": "stream",
400
+ "text": [
401
+ "Question: The speaker in the poem \"The Road Not Taken\" is faced with a choice between _______ roads.\n",
402
+ "\n",
403
+ "Please provide your answer.\n"
404
+ ]
405
+ }
406
+ ],
407
+ "source": [
408
+ "# Fill in the blank code example\n",
409
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = SELF_STUDY_DEFAULTS['fill_blank'],\n",
410
+ " learning_objectives = learning_objs)\n",
411
+ "\n",
412
+ "print(tutor_q)"
413
+ ]
414
+ },
415
+ {
416
+ "cell_type": "markdown",
417
+ "metadata": {},
418
+ "source": [
419
+ "### Sequencing\n",
420
+ "\n",
421
+ "Prompt: Please develop a 5 question questionnaire that will ask me to recall the steps involved in the following learning objectives in regard to {context}: {learning objectives}. After I respond, explain their sequence to me."
422
+ ]
423
+ },
424
+ {
425
+ "cell_type": "code",
426
+ "execution_count": null,
427
+ "metadata": {},
428
+ "outputs": [
429
+ {
430
+ "name": "stdout",
431
+ "output_type": "stream",
432
+ "text": [
433
+ "Question 1: What is the underlying message or theme of the provided text?\n",
434
+ "\n",
435
+ "(Note: Please provide your response and I will evaluate it.)\n"
436
+ ]
437
+ }
438
+ ],
439
+ "source": [
440
+ "# Sequence example\n",
441
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = SELF_STUDY_DEFAULTS['sequencing'],\n",
442
+ " learning_objectives = learning_objs)\n",
443
+ "\n",
444
+ "print(tutor_q)"
445
+ ]
446
+ },
447
+ {
448
+ "cell_type": "markdown",
449
+ "metadata": {},
450
+ "source": [
451
+ "### Relationships/drawing connections\n",
452
+ "\n",
453
+ "Prompt: Please design a 5 question quiz that asks me to explain the relationships that exist within the following learning objectives, referencing {context}: {learning objectives}."
454
+ ]
455
+ },
456
+ {
457
+ "cell_type": "code",
458
+ "execution_count": null,
459
+ "metadata": {},
460
+ "outputs": [
461
+ {
462
+ "name": "stdout",
463
+ "output_type": "stream",
464
+ "text": [
465
+ "Question 1: What is the underlying message or theme of the text \"The Road Not Taken\"?\n",
466
+ "\n",
467
+ "(Note: The answer to this question will require the student to identify the key elements and important takeaways from the text in order to determine the underlying message or theme.)\n"
468
+ ]
469
+ }
470
+ ],
471
+ "source": [
472
+ "# Relationships example\n",
473
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = SELF_STUDY_DEFAULTS['relationships'],\n",
474
+ " learning_objectives = learning_objs)\n",
475
+ "\n",
476
+ "print(tutor_q)"
477
+ ]
478
+ },
479
+ {
480
+ "cell_type": "markdown",
481
+ "metadata": {},
482
+ "source": [
483
+ "### Concepts and Definitions\n",
484
+ "\n",
485
+ "Prompt: Design a 5 question quiz that asks me about definitions related to the following learning objectives: {learning objectives} - based on {context}\".\n",
486
+ "Once I write out my response, provide me with your own response, highlighting why my answer is correct or incorrect."
487
+ ]
488
+ },
489
+ {
490
+ "cell_type": "code",
491
+ "execution_count": null,
492
+ "metadata": {},
493
+ "outputs": [
494
+ {
495
+ "name": "stdout",
496
+ "output_type": "stream",
497
+ "text": [
498
+ "Question 1: Based on the provided text, what is the underlying message or theme of the work?\n",
499
+ "\n",
500
+ "Please provide your response.\n"
501
+ ]
502
+ }
503
+ ],
504
+ "source": [
505
+ "# Concepts and definitions example\n",
506
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = SELF_STUDY_DEFAULTS['concepts'],\n",
507
+ " learning_objectives = learning_objs)\n",
508
+ "\n",
509
+ "print(tutor_q)"
510
+ ]
511
+ },
512
+ {
513
+ "cell_type": "markdown",
514
+ "metadata": {},
515
+ "source": [
516
+ "### Real Word Examples\n",
517
+ "\n",
518
+ "Prompt: Demonstrate how {context} can be applied to solve a real-world problem related to the following learning objectives: {learning objectives}. Ask me questions regarding this theory/concept."
519
+ ]
520
+ },
521
+ {
522
+ "cell_type": "code",
523
+ "execution_count": null,
524
+ "metadata": {},
525
+ "outputs": [
526
+ {
527
+ "name": "stdout",
528
+ "output_type": "stream",
529
+ "text": [
530
+ "Based on the provided context, it seems that the extracted text is a poem by Robert Frost and does not directly provide any information or context related to problem-solving in the real world. Therefore, it may not be possible to demonstrate how the provided context can be applied to solve a real-world problem. However, I can still assess your understanding of the learning objectives mentioned. Let's start with the first learning objective: identifying the key elements of the work, important takeaways, and underlying message. \n",
531
+ "\n",
532
+ "Question 1: Based on your reading of the poem, what are some key elements or important takeaways that you can identify?\n"
533
+ ]
534
+ }
535
+ ],
536
+ "source": [
537
+ "# Real word example\n",
538
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = SELF_STUDY_DEFAULTS['real_world_example'],\n",
539
+ " learning_objectives = learning_objs)\n",
540
+ "\n",
541
+ "print(tutor_q)"
542
+ ]
543
+ },
544
+ {
545
+ "cell_type": "markdown",
546
+ "metadata": {},
547
+ "source": [
548
+ "### Randomized Question Types\n",
549
+ "\n",
550
+ "Prompt: Please generate a high-quality assessment consisting of 5 varying questions, each of different types (open-ended, multiple choice, etc.), to determine if I achieved the following learning objectives in regards to {context}: {learning objectives}. If I answer incorrectly for any of the questions, please explain why my answer is incorrect."
551
+ ]
552
+ },
553
+ {
554
+ "cell_type": "code",
555
+ "execution_count": null,
556
+ "metadata": {},
557
+ "outputs": [
558
+ {
559
+ "name": "stdout",
560
+ "output_type": "stream",
561
+ "text": [
562
+ "Question 1 (Open-ended):\n",
563
+ "Based on the given excerpt, what do you think is the underlying message or theme of the text? Please provide a brief explanation to support your answer.\n",
564
+ "\n",
565
+ "(Note: The answer to this question will vary depending on the student's interpretation of the text. As the tutor, you can provide feedback on the strengths and weaknesses of their response, and guide them towards a deeper understanding of the text's message.)\n"
566
+ ]
567
+ }
568
+ ],
569
+ "source": [
570
+ "# Randomized question types\n",
571
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = SELF_STUDY_DEFAULTS['randomized_questions'],\n",
572
+ " learning_objectives = learning_objs)\n",
573
+ "\n",
574
+ "print(tutor_q)"
575
+ ]
576
+ },
577
+ {
578
+ "cell_type": "markdown",
579
+ "metadata": {},
580
+ "source": [
581
+ "### Quantiative evaluation the correctness of a student's answer\n",
582
+ "\n",
583
+ "Prompt: (A continuation of the previous chat) Please generate the main points of the student’s answer to the previous question, and evaluate on a scale of 1 to 5 how comprehensive the student’s answer was in relation to the learning objectives, and explain why he or she received this rating, including what was missed in his or her answer if the student’s answer wasn’t complete.\n"
584
+ ]
585
+ },
586
+ {
587
+ "cell_type": "code",
588
+ "execution_count": null,
589
+ "metadata": {},
590
+ "outputs": [
591
+ {
592
+ "name": "stdout",
593
+ "output_type": "stream",
594
+ "text": [
595
+ "Main points of the student's answer:\n",
596
+ "- The underlying message of the text is that people should follow the crowd and take the easy way instead of the road less traveled.\n",
597
+ "- The road less traveled is hard and painful to traverse.\n",
598
+ "\n",
599
+ "Evaluation of the student's answer:\n",
600
+ "I would rate the student's answer a 2 out of 5 in terms of comprehensiveness in relation to the learning objectives. \n",
601
+ "\n",
602
+ "Explanation:\n",
603
+ "The student correctly identifies that the underlying message of the text is related to choosing between two paths, but their interpretation of the message is not entirely accurate. The student suggests that the text encourages people to follow the crowd and take the easy way, which is not supported by the actual message of the poem. The poem actually suggests that taking the road less traveled can make a significant difference in one's life. The student also mentions that the road less traveled is hard and painful to traverse, which is not explicitly stated in the text. This interpretation may be influenced by the student's personal perspective rather than the actual content of the poem. Therefore, the student's answer is not complete and does not fully grasp the intended message of the text.\n"
604
+ ]
605
+ }
606
+ ],
607
+ "source": [
608
+ "# qualitative evaluation\n",
609
+ "qualitative_query = \"\"\" Please generate the main points of the student’s answer to the previous question,\n",
610
+ " and evaluate on a scale of 1 to 5 how comprehensive the student’s answer was in relation to the learning objectives,\n",
611
+ " and explain why he or she received this rating, including what was missed in his or her answer if the student’s answer wasn’t complete.\"\"\"\n",
612
+ "\n",
613
+ "last_answer = (\"TUTOR QUESTION: Question 1 (Open-ended): \" +\n",
614
+ " \"Based on the given excerpt, what do you think is the underlying message or theme of the text? Please provide a \" + \n",
615
+ " \"brief explanation to support your answer.\\n\" + \n",
616
+ " \"STUDENT ANSWER: The underlying message of the text is that people should follow the crowd and the road less traveled is hard \"+\n",
617
+ " \"and painful to traverse. Take the easy way instead. \")\n",
618
+ "\n",
619
+ "# Note that this uses the previous result and query in the context\n",
620
+ "tutor_q = get_tutoring_answer(None, qa_chain, assessment_request = qualitative_query + '\\n' + last_answer,\n",
621
+ " learning_objectives = learning_objs,\n",
622
+ " input_kwargs = {'question':topic})\n",
623
+ "\n",
624
+ "print(tutor_q)"
625
+ ]
626
+ }
627
+ ],
628
+ "metadata": {
629
+ "kernelspec": {
630
+ "display_name": "python3",
631
+ "language": "python",
632
+ "name": "python3"
633
+ }
634
+ },
635
+ "nbformat": 4,
636
+ "nbformat_minor": 0
637
+ }
prompt_with_vector_store_w_grading_intr.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
settings.ini ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [DEFAULT]
2
+ # All sections below are required unless otherwise specified.
3
+ # See https://github.com/fastai/nbdev/blob/master/settings.ini for examples.
4
+
5
+ ### Python library ###
6
+ repo = lo-achievement
7
+ lib_name = ai_classroom_suite
8
+ version = 0.0.1
9
+ min_python = 3.7
10
+ license = apache2
11
+ black_formatting = False
12
+
13
+ ### nbdev ###
14
+ doc_path = _docs
15
+ lib_path = ai_classroom_suite
16
+ nbs_path = nbs
17
+ recursive = True
18
+ tst_flags = notest
19
+ put_version_in_init = True
20
+
21
+ ### Docs ###
22
+ branch = main
23
+ custom_sidebar = False
24
+ doc_host = https://%(user)s.github.io
25
+ doc_baseurl = /%(repo)s
26
+ git_url = https://github.com/%(user)s/%(repo)s
27
+ title = %(lib_name)s
28
+
29
+ ### PyPI ###
30
+ audience = Developers
31
+ author = Charreau Bell
32
+ author_email = charreau.s.bell@vanderbilt.edu
33
+ copyright = 2023 onwards, %(author)s
34
+ description = A repository supporting enhanced instruction and grading using AI
35
+ keywords = nbdev jupyter notebook python
36
+ language = English
37
+ status = 3
38
+ user = vanderbilt-data-science
39
+
40
+ ### Optional ###
41
+ requirements = langchain pandas numpy getpass openai gradio chromadb tiktoken unstructured pdf2image yt_dlp libmagic chromadb librosa deeplake ipyfilechooser
42
+ # dev_requirements =
43
+ # console_scripts =
setup.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pkg_resources import parse_version
2
+ from configparser import ConfigParser
3
+ import setuptools, shlex
4
+ assert parse_version(setuptools.__version__)>=parse_version('36.2')
5
+
6
+ # note: all settings are in settings.ini; edit there, not here
7
+ config = ConfigParser(delimiters=['='])
8
+ config.read('settings.ini', encoding='utf-8')
9
+ cfg = config['DEFAULT']
10
+
11
+ cfg_keys = 'version description keywords author author_email'.split()
12
+ expected = cfg_keys + "lib_name user branch license status min_python audience language".split()
13
+ for o in expected: assert o in cfg, "missing expected setting: {}".format(o)
14
+ setup_cfg = {o:cfg[o] for o in cfg_keys}
15
+
16
+ licenses = {
17
+ 'apache2': ('Apache Software License 2.0','OSI Approved :: Apache Software License'),
18
+ 'mit': ('MIT License', 'OSI Approved :: MIT License'),
19
+ 'gpl2': ('GNU General Public License v2', 'OSI Approved :: GNU General Public License v2 (GPLv2)'),
20
+ 'gpl3': ('GNU General Public License v3', 'OSI Approved :: GNU General Public License v3 (GPLv3)'),
21
+ 'bsd3': ('BSD License', 'OSI Approved :: BSD License'),
22
+ }
23
+ statuses = [ '1 - Planning', '2 - Pre-Alpha', '3 - Alpha',
24
+ '4 - Beta', '5 - Production/Stable', '6 - Mature', '7 - Inactive' ]
25
+ py_versions = '3.6 3.7 3.8 3.9 3.10'.split()
26
+
27
+ requirements = shlex.split(cfg.get('requirements', ''))
28
+ if cfg.get('pip_requirements'): requirements += shlex.split(cfg.get('pip_requirements', ''))
29
+ min_python = cfg['min_python']
30
+ lic = licenses.get(cfg['license'].lower(), (cfg['license'], None))
31
+ dev_requirements = (cfg.get('dev_requirements') or '').split()
32
+
33
+ setuptools.setup(
34
+ name = cfg['lib_name'],
35
+ license = lic[0],
36
+ classifiers = [
37
+ 'Development Status :: ' + statuses[int(cfg['status'])],
38
+ 'Intended Audience :: ' + cfg['audience'].title(),
39
+ 'Natural Language :: ' + cfg['language'].title(),
40
+ ] + ['Programming Language :: Python :: '+o for o in py_versions[py_versions.index(min_python):]] + (['License :: ' + lic[1] ] if lic[1] else []),
41
+ url = cfg['git_url'],
42
+ packages = setuptools.find_packages(),
43
+ include_package_data = True,
44
+ install_requires = requirements,
45
+ extras_require={ 'dev': dev_requirements },
46
+ dependency_links = cfg.get('dep_links','').split(),
47
+ python_requires = '>=' + cfg['min_python'],
48
+ long_description = open('README.md', encoding='utf-8').read(),
49
+ long_description_content_type = 'text/markdown',
50
+ zip_safe = False,
51
+ entry_points = {
52
+ 'console_scripts': cfg.get('console_scripts','').split(),
53
+ 'nbdev': [f'{cfg.get("lib_path")}={cfg.get("lib_path")}._modidx:d']
54
+ },
55
+ **setup_cfg)
56
+
57
+
speech_to_text_models.ipynb ADDED
The diff for this file is too large to render. See raw diff