srikanth-nm commited on
Commit
b30ed6a
·
1 Parent(s): 557a8c5

Upload 19 files

Browse files
SOURCE_DOCUMENTS/confluence.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ India's Income Tax Laws are framed by the Government The Government imposes a tax on taxable income of all persons who are individuals, Hindu Undivided Families (HUF's), companies, firms, LLP, association of persons, body of individuals, local authority and any other artificial juridical person. According to these laws, levy of tax on a person depends upon his residential status. Every individual who qualifies as a resident of India is required to pay tax on his or her global income. Every financial year, taxpayers have to follow certain rules while filing their Income Tax Returns (ITRs).Income Tax Return - What is it?An Income tax return (ITR) is a form used to file information about your income and tax to the Income Tax Department. The tax liability of a taxpayer is calculated based on his or her income. In case the return shows that excess tax has been paid during a year, then the individual will be eligible to receive a income tax refund from the Income Tax Department.As per the income tax laws, the return must be filed every year by an individual or business that earns any income during a financial year. The income could be in the form of a salary, business profits, income from house property or earned through dividends, capital gains, interests or other sources.Tax returns have to be filed by an individual or a business before a specified date. If a taxpayer fails to abide by the deadline, he or she has to pay a penalty. Is it mandatory to file Income Tax Return?As per the tax laws laid down in India, it is compulsory to file your income tax returns if your income is more than the basic exemption limit. The income tax rate is pre-decided for taxpayers. A delay in filing returns will not only attract late filing fees but also hamper your chances of getting a loan or a visa for travel purposes.Who should file Income Tax Returns?According to the Income Tax Act, income tax has to be paid only by individuals or businesses who fall within certain income brackets.
SOURCE_DOCUMENTS/transcript.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ In this video, I'm going to answer the top 3 questions my students ask me about Python. What is Python? What can you do with it? And why is it so popular? In other words, what does it do that other programming languages don't? Python is the world's fastest growing and most popular programming language, not just amongst software engineers, but also amongst mathematicians, data analysts, scientists, accountants, networking engineers, and even kids! Because it's a very beginner friendly programming language. So people from different disciplines use Python for a variety of different tasks, such as data analysis and visualization, artificial intelligence and machine learning, automation in fact this is one of the big uses of Python amongst people who are not software developers. If you constantly have to do boring, repetitive tasks, such as copying files and folders around, renaming them, uploading them to a server, you can easily write a Python script to automate all that and save your time. And that's just one example, if you continuously have to work with excel spreadsheets, PDF's, CS View files, download websites and parse them, you can automate all that stuff with Python. So you don't have to be a software developer to use Python. You could be an accountant, a mathematician, or a scientist, and use Python to make your life easier. You can also use Python to build web, mobile and desktop applications as well as software testing or even hacking. So Python is a multi purpose language. Now if you have some programming experience you may say, "But Mosh we can do all this stuff with other programming languages, so what's the big deal about Python?" Here are a few reasons. With Python you can solve complex problems in less time with fewer lines of code. Here's an example. Let's say we want to extract the first three letters of the text Hello World. This is the code we have to write in C# this is how we do it in JavaScript and here's how we do it in Python. See how short and clean the language is? And that's just the beginning. Python makes a lot of trivial things really easy with a simple yet powerful syntax. Here are a few other reasons Python is so popular. It's a high level language so you don't have to worry about complex tasks such as memory management, like you do in C++. It's cross platform which means you can build and run Python applications on Windows, Mac, and Linux. It has a huge community so whenever you get stuck, there is someone out there to help. It has a large ecosystem of libraries, frameworks and tools which means whatever you wanna do it is likely that someone else has done it before because Python has been around for over 20 years. So in a nutshell, Python is a multi-purpose language with a simple, clean, and beginner-friendly syntax. All of that means Python is awesome. Technically everything you do with Python you can do with other programming languages, but Python's simplicity and elegance has made it grow way more than other programming languages. That's why it's the number onne language employers are looking for. So whether you're a programmer or an absolute beginner, learning Python opens up lots of job opportunities to you. In fact, the average Python developer earns a whopping 116,000 dollars a year. If you found this video helpful, please support my hard work by liking and sharing it with others. Also, be sure to subscribe to my channel, because I have a couple of awesome Python tutorials for you, you're going to see them on the screen now. Here's my Python tutorial for beginners, it's a great starting point if you have limited or no programming experience. On the other hand, if you do have some programming experience and want to quickly get up to speed with Python, I have another tutorial just for you. I'm not going to waste your time telling you what a variable or a function is. I will talk to you like a programmer. There's never been a better time to master Python programming, so click on the tutorial that is right for you and get started. Thank you for watching!
__init__.py ADDED
File without changes
app.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Import required libraries
2
+ import streamlit as st
3
+ import youtube
4
+ import confluence
5
+ import modJira
6
+ import time
7
+ import similarity
8
+ import ingest
9
+
10
+ # global transcript_result
11
+ # transcript_result = ""
12
+
13
+ # Set page configuration and title for Streamlit
14
+ st.set_page_config(page_title="AI-Seeker", page_icon="📼", layout="wide")
15
+
16
+ # Add header with title and description
17
+ st.markdown(
18
+ '<p style="display:inline-block;font-size:40px;font-weight:bold;">AI-Seeker</p> <p style="display:inline-block;font-size:16px;">AI-Seeker is a web-app tool that utilizes APIs to extract text content from YouTube, Confluence and Jira. It incorporates Llama-2-7B-Chat-GGML model with Langchain to provide users with a summary and query-based smart response depending on the content of the media source.<br><br></p>',
19
+ unsafe_allow_html=True
20
+ )
21
+
22
+ txtInputBox = "YouTube"
23
+
24
+
25
+ with st.sidebar.title("Configuration"):
26
+ usecase = st.sidebar.selectbox("Select Media Type:",("YouTube", "Confluence", "Jira"))
27
+ if usecase == "YouTube":
28
+ txtInputBox = "Enter ID of YouTube Video"
29
+ default_value = "Y8Tko2YC5hA"
30
+ elif usecase == "Confluence":
31
+ txtInputBox = "Enter ID of your Confluence Page"
32
+ default_value = "393217"
33
+ elif usecase == "Jira":
34
+ txtInputBox = "Enter the name of your JIRA Project"
35
+ default_value = "jira_test"
36
+
37
+ video_id = st.sidebar.text_input(txtInputBox,value=default_value)
38
+
39
+ strTranscript = ""
40
+ training_status = "yet_to_start"
41
+ btnTranscript = st.sidebar.button("Transcript")
42
+ btnSummary = st.sidebar.button("Summary")
43
+ btnTrain = st.sidebar.button("Train")
44
+ if btnTrain:
45
+ with st.spinner("Training in Progress..."):
46
+ ingest.main()
47
+
48
+ query = st.sidebar.text_input('Enter your question below:', value="What is Python?")
49
+ btnAsk = st.sidebar.button("Query")
50
+
51
+ btnClear = st.sidebar.button("Clear Data")
52
+ if btnClear:
53
+ st.session_state.clear()
54
+
55
+ def fnJira():
56
+ st.info("Transcription")
57
+
58
+ if btnTranscript:
59
+
60
+ if 'transcript_result' not in st.session_state:
61
+ st.session_state['transcript_result'] = modJira.get_details(video_id)
62
+ transcript_result = st.session_state['transcript_result']
63
+ st.dataframe(transcript_result)
64
+ else:
65
+ if 'transcript_result' in st.session_state:
66
+ transcript_result = st.session_state['transcript_result']
67
+ st.dataframe(transcript_result)
68
+
69
+ st.info("Query")
70
+
71
+ if btnAsk:
72
+ with st.spinner(text="Retrieving..."):
73
+ if 'transcript_answer' not in st.session_state:
74
+ answer = modJira.ask_question(query)
75
+ st.session_state['transcript_answer'] = answer
76
+ #st.success(answer)
77
+ if 'transcript_answer' in st.session_state:
78
+ answer = st.session_state['transcript_answer']
79
+
80
+ st.success(answer)
81
+
82
+ else:
83
+ if 'transcript_answer' in st.session_state:
84
+ answer = st.session_state['transcript_answer']
85
+
86
+ st.success(answer)
87
+
88
+
89
+ def fnConfluence():
90
+ st.info("Transcription")
91
+
92
+ if btnTranscript:
93
+
94
+ if 'transcript_result' not in st.session_state:
95
+ st.session_state['transcript_result'] = confluence.transcript(video_id)
96
+ transcript_result = st.session_state['transcript_result']
97
+ st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
98
+ else:
99
+ if 'transcript_result' in st.session_state:
100
+ transcript_result = st.session_state['transcript_result']
101
+ st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
102
+
103
+ col1, col2 = st.columns([1, 1])
104
+
105
+ with col1:
106
+ # with col12:
107
+ st.info("Summary")
108
+ if btnSummary:
109
+ if 'transcript_summary' not in st.session_state:
110
+ with st.spinner(text="Retrieving..."):
111
+ st.session_state['transcript_summary'] = confluence.summarize()
112
+ summary = st.session_state['transcript_summary']
113
+ st.success(summary)
114
+ else:
115
+ if 'transcript_summary' in st.session_state:
116
+ summary = st.session_state['transcript_summary']
117
+ st.success(summary)
118
+
119
+ with col2:
120
+ st.info("Query")
121
+
122
+ if btnAsk:
123
+ with st.spinner(text="Retrieving..."):
124
+ if 'transcript_answer' not in st.session_state:
125
+ answer = confluence.ask_question(query)
126
+ st.session_state['transcript_answer'] = answer
127
+ #st.success(answer)
128
+ if 'transcript_answer' in st.session_state:
129
+ answer = st.session_state['transcript_answer']
130
+
131
+ st.success(answer)
132
+
133
+ else:
134
+ if 'transcript_answer' in st.session_state:
135
+ answer = st.session_state['transcript_answer']
136
+
137
+ st.success(answer)
138
+
139
+ def fnYoutube():
140
+ st.info("Transcription")
141
+
142
+ if btnTranscript:
143
+
144
+ if 'transcript_result' not in st.session_state:
145
+ st.session_state['transcript_result'] = youtube.audio_to_transcript(video_id)
146
+ transcript_result = st.session_state['transcript_result']
147
+ st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
148
+ else:
149
+ if 'transcript_result' in st.session_state:
150
+ transcript_result = st.session_state['transcript_result']
151
+ st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
152
+
153
+ col1, col2 = st.columns([1, 1])
154
+
155
+ with col1:
156
+ # with col12:
157
+ st.info("Summary")
158
+ if btnSummary:
159
+ if 'transcript_summary' not in st.session_state:
160
+ with st.spinner(text="Retrieving..."):
161
+ st.session_state['transcript_summary'] = youtube.summarize()
162
+ summary = st.session_state['transcript_summary']
163
+ st.success(summary)
164
+ else:
165
+ if 'transcript_summary' in st.session_state:
166
+ summary = st.session_state['transcript_summary']
167
+ st.success(summary)
168
+
169
+ with col2:
170
+ st.info("Query")
171
+
172
+ if btnAsk:
173
+ with st.spinner(text="Retrieving..."):
174
+ if 'transcript_answer' not in st.session_state:
175
+ answer = youtube.ask_question(query)
176
+ st.session_state['transcript_answer'] = answer
177
+ #st.success(answer)
178
+ if 'transcript_answer' in st.session_state:
179
+ answer = st.session_state['transcript_answer']
180
+
181
+ st.success(answer)
182
+
183
+ transcript_start_time, transcript_end_time = similarity.similarity(strQuery=answer)
184
+
185
+ st.video(f"https://www.youtube.com/embed/{video_id}", format="video/mp4", start_time=int(transcript_start_time))
186
+
187
+ else:
188
+ if 'transcript_answer' in st.session_state:
189
+ answer = st.session_state['transcript_answer']
190
+
191
+ st.success(answer)
192
+
193
+ transcript_start_time, transcript_end_time = similarity.similarity(strQuery=answer)
194
+
195
+ st.video(f"https://www.youtube.com/embed/{video_id}", format="video/mp4", start_time=int(transcript_start_time))
196
+
197
+ if usecase == "YouTube":
198
+ fnYoutube()
199
+ elif usecase == "Confluence":
200
+ fnConfluence()
201
+ elif usecase == "Jira":
202
+ fnJira()
203
+
204
+ # Hide Streamlit header, footer, and menu
205
+ hide_st_style = """
206
+ <style>
207
+ #MainMenu {visibility: hidden;}
208
+ footer {visibility: hidden;}
209
+ header {visibility: hidden;}
210
+ </style>
211
+ """
212
+ #"""footer {visibility: hidden;}
213
+ # header {visibility: hidden;}"""
214
+
215
+ # Apply CSS code to hide header, footer, and menu
216
+ st.markdown(hide_st_style, unsafe_allow_html=True)
chunks.json ADDED
@@ -0,0 +1 @@
 
 
1
+ [{"text": "In this video, I'm going to answer the top 3 questions my students ask me about Python. What is Python? What can you do with it? And why is it so popular? In other words, what does it do that other programming languages don't? Python is the ", "start": 0.0, "end": 16.0}, {"text": "world's fastest growing and most popular programming language, not just amongst software engineers, but also amongst mathematicians, data analysts, scientists, accountants, networking engineers, and even kids! Because it's a very beginner friendly programming ", "start": 16.0, "end": 32.0}, {"text": "language. So people from different disciplines use Python for a variety of different tasks, such as data analysis and visualization, artificial intelligence and machine learning, automation in fact this is one of the big uses of Python amongst people who are not software", "start": 32.0, "end": 48.0}, {"text": "developers. If you constantly have to do boring, repetitive tasks, such as copying files and folders around, renaming them, uploading them to a server, you can easily write a Python script to automate all that and save your time. And that's just one example, if you", "start": 48.0, "end": 64.0}, {"text": "continuously have to work with excel spreadsheets, PDF's, CS View files, download websites and parse them, you can automate all that stuff with Python. So you don't have to be a software developer to use Python. You could be an accountant, a mathematician, or a scientist, and use Python ", "start": 64.0, "end": 80.0}, {"text": "to make your life easier. You can also use Python to build web, mobile and desktop applications as well as software testing or even hacking. So Python is a multi purpose language. Now if you have some programming experience you may say, \"But Mosh", "start": 80.0, "end": 96.0}, {"text": "we can do all this stuff with other programming languages, so what's the big deal about Python?\" Here are a few reasons. With Python you can solve complex problems in less time with fewer lines of code. Here's an example. Let's say we want to extract the first three ", "start": 96.0, "end": 112.0}, {"text": "letters of the text Hello World. This is the code we have to write in C# this is how we do it in JavaScript and here's how we do it in Python. See how short and clean the language is? And that's just the beginning. Python makes a lot of trivial things", "start": 112.0, "end": 128.0}, {"text": "really easy with a simple yet powerful syntax. Here are a few other reasons Python is so popular. It's a high level language so you don't have to worry about complex tasks such as memory management, like you do in C++. It's cross platform which means ", "start": 128.0, "end": 144.0}, {"text": "you can build and run Python applications on Windows, Mac, and Linux. It has a huge community so whenever you get stuck, there is someone out there to help. It has a large ecosystem of libraries, frameworks and tools which means whatever you wanna do", "start": 144.0, "end": 160.0}, {"text": "it is likely that someone else has done it before because Python has been around for over 20 years. So in a nutshell, Python is a multi-purpose language with a simple, clean, and beginner-friendly syntax. All of that means Python is awesome.", "start": 160.0, "end": 176.0}, {"text": "Technically everything you do with Python you can do with other programming languages, but Python's simplicity and elegance has made it grow way more than other programming languages. That's why it's the number onne language employers are looking for. So whether you're a programmer or ", "start": 176.0, "end": 192.0}, {"text": "an absolute beginner, learning Python opens up lots of job opportunities to you. In fact, the average Python developer earns a whopping 116,000 dollars a year. If you found this video helpful, please support my hard work by liking and sharing it with others. ", "start": 192.0, "end": 208.0}, {"text": "Also, be sure to subscribe to my channel, because I have a couple of awesome Python tutorials for you, you're going to see them on the screen now. Here's my Python tutorial for beginners, it's a great starting point if you have limited or no programming experience. On the other hand, if you ", "start": 208.0, "end": 224.0}, {"text": "do have some programming experience and want to quickly get up to speed with Python, I have another tutorial just for you. I'm not going to waste your time telling you what a variable or a function is. I will talk to you like a programmer. There's never been a better time to master Python programming,", "start": 224.0, "end": 240.0}, {"text": "so click on the tutorial that is right for you and get started. Thank you for watching!", "start": 240.0, "end": 246.63}]
chunks_create.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+
3
+ def combine_and_calculate(input_file_path, output_file_path):
4
+ with open(input_file_path, 'r') as file:
5
+ output_data = json.load(file)
6
+
7
+ combined_json_list = []
8
+
9
+ # Calculate the number of groups to create
10
+ num_groups = (len(output_data) + 7) // 8
11
+
12
+ for group_num in range(num_groups):
13
+ # Calculate the starting index and ending index for the current group
14
+ start_index = group_num * 8
15
+ end_index = min(start_index + 8, len(output_data))
16
+
17
+ # Extract the "text" values from the current group of dictionaries
18
+ combined_text = " ".join([item["text"] for item in output_data[start_index:end_index]])
19
+
20
+ # Calculate the "start" and "end" for the current group
21
+ group_start = output_data[start_index]["start"]
22
+ group_end = output_data[end_index - 1]["end"]
23
+
24
+ # Create the combined JSON for the current group
25
+ combined_json = {
26
+ "text": combined_text,
27
+ "start": group_start,
28
+ "end": group_end,
29
+ }
30
+
31
+ combined_json_list.append(combined_json)
32
+
33
+ # Save the combined JSON list to a new file
34
+ with open(output_file_path, 'w') as output_file:
35
+ json.dump(combined_json_list, output_file)
36
+
37
+ # Replace 'output_file.json' with the path to the output JSON file you created previously
38
+ input_file_path = '/home/bharathi/langchain_experiments/GenAI/transcript_end.json'
39
+
40
+ # Replace 'combined_output_file.json' with the desired path and filename for the combined JSON file
41
+ output_file_path = '/home/bharathi/langchain_experiments/GenAI/chunks.json'
42
+
43
+ # Call the function to create the combined JSON and save it to a new file
44
+ combine_and_calculate(input_file_path, output_file_path)
confluence.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from bs4 import BeautifulSoup
2
+ import requests
3
+ from requests.auth import HTTPBasicAuth
4
+ import run_localGPT
5
+
6
+ def start_training():
7
+ training_status = ingest.main()
8
+ return training_status
9
+
10
+ def replace_substring_and_following(input_string, substring):
11
+ index = input_string.find(substring)
12
+ if index != -1:
13
+ return input_string[:index]
14
+ else:
15
+ return input_string
16
+
17
+ def ask_question(strQuestion):
18
+ answer = run_localGPT.main(device_type='cpu', strQuery=strQuestion)
19
+ answer_cleaned = replace_substring_and_following(answer, "Unhelpful Answer")
20
+ return answer_cleaned
21
+
22
+ def transcript(page_id):
23
+
24
+ url = f"https://srikanthnm.atlassian.net/wiki/rest/api/content/{page_id}?expand=body.storage" # Replace with the actual URL you want to access
25
+ username = "srikanth.nm@gmail.com"
26
+ password = "ATATT3xFfGF09rugcjiT06v8xMayt5ggayMNiwz4b6w07PWQxPvpi4fMDzwwHxKt-v8dGx49uiulIMKHUUYroeS8cXvMKYfi7sQnFsYNfGslPVqSq1BQrzPhTio-xmYOHcit5ijzU9cSGGa7eLXUMxQTsSQjLhtZ-EQPI8h6aki690_-evLFZmU=3910FFD4"
27
+
28
+
29
+ response = requests.get(url, auth=HTTPBasicAuth(username, password))
30
+
31
+ # Check if the request was successful (status code 200)
32
+ if response.status_code == 200:
33
+ # Process the response data (if applicable)
34
+ data = response.json()
35
+ else:
36
+ data = f"Error: Unable to access the URL. Status code: {response.status_code}"
37
+
38
+ soup = BeautifulSoup(data['body']['storage']['value'],"html.parser")
39
+
40
+ page_content = soup.get_text()
41
+ page_content_cleaned = page_content.replace('\xa0',' ')
42
+ page_content_cleaned
43
+
44
+ with open('SOURCE_DOCUMENTS/confluence.txt', 'w') as outfile:
45
+ outfile.write(page_content_cleaned[:1998])
46
+
47
+ return page_content_cleaned[:1998]
48
+
49
+ def summarize():
50
+ from langchain import PromptTemplate, LLMChain
51
+ from langchain.text_splitter import CharacterTextSplitter
52
+ from langchain.chains.mapreduce import MapReduceChain
53
+ from langchain.prompts import PromptTemplate
54
+
55
+ model_id = "TheBloke/Llama-2-7B-Chat-GGML"
56
+ model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
57
+
58
+ llm = run_localGPT.load_model(device_type='cpu', model_id=model_id, model_basename=model_basename)
59
+
60
+ text_splitter = CharacterTextSplitter()
61
+
62
+ with open("SOURCE_DOCUMENTS/confluence.txt") as f:
63
+ file_content = f.read()
64
+ texts = text_splitter.split_text(file_content)
65
+
66
+ from langchain.docstore.document import Document
67
+
68
+ docs = [Document(page_content=t) for t in texts]
69
+
70
+ from langchain.chains.summarize import load_summarize_chain
71
+
72
+ chain = load_summarize_chain(llm, chain_type="map_reduce")
73
+ summary = chain.run(docs)
74
+
75
+ return summary
constants.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ # from dotenv import load_dotenv
4
+ from chromadb.config import Settings
5
+
6
+ # https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/excel.html?highlight=xlsx#microsoft-excel
7
+ from langchain.document_loaders import CSVLoader, PDFMinerLoader, TextLoader, UnstructuredExcelLoader, Docx2txtLoader
8
+
9
+ # load_dotenv()
10
+ ROOT_DIRECTORY = os.path.dirname(os.path.realpath(__file__))
11
+
12
+ # Define the folder for storing database
13
+ SOURCE_DIRECTORY = f"{ROOT_DIRECTORY}/SOURCE_DOCUMENTS"
14
+
15
+ PERSIST_DIRECTORY = f"{ROOT_DIRECTORY}/DB"
16
+
17
+ # Can be changed to a specific number
18
+ INGEST_THREADS = os.cpu_count() or 8
19
+
20
+ # Define the Chroma settings
21
+ CHROMA_SETTINGS = Settings(
22
+ chroma_db_impl="duckdb+parquet", persist_directory=PERSIST_DIRECTORY, anonymized_telemetry=False
23
+ )
24
+
25
+ # https://python.langchain.com/en/latest/_modules/langchain/document_loaders/excel.html#UnstructuredExcelLoader
26
+ DOCUMENT_MAP = {
27
+ ".txt": TextLoader,
28
+ ".md": TextLoader,
29
+ ".py": TextLoader,
30
+ ".pdf": PDFMinerLoader,
31
+ ".csv": CSVLoader,
32
+ ".xls": UnstructuredExcelLoader,
33
+ ".xlsx": UnstructuredExcelLoader,
34
+ ".docx": Docx2txtLoader,
35
+ ".doc": Docx2txtLoader,
36
+ }
37
+
38
+ # Default Instructor Model
39
+ EMBEDDING_MODEL_NAME = "hkunlp/instructor-large"
40
+ # You can also choose a smaller model, don't forget to change HuggingFaceInstructEmbeddings
41
+ # to HuggingFaceEmbeddings in both ingest.py and run_localGPT.py
42
+ # EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
end_calculate.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+
3
+ def calculate_end_from_file(input_file_path, output_file_path):
4
+ with open(input_file_path, 'r') as file:
5
+ input_data = json.load(file)
6
+
7
+ # Iterate through the list of dictionaries and calculate "end" for each one
8
+ for item in input_data:
9
+ item["end"] =round(item["start"] + item["duration"],2)
10
+ del item["duration"] # Remove the "duration" key from each dictionary
11
+
12
+ # Save the updated data to a new JSON file
13
+ with open(output_file_path, 'w') as output_file:
14
+ json.dump(input_data, output_file)
15
+
16
+ # # Replace 'input_file.json' with the actual path to your input JSON file
17
+ # input_file_path = '/home/bharathi/langchain_experiments/GenAI/transcript.json'
18
+
19
+ # # Replace 'output_file.json' with the desired path and filename for the output JSON file
20
+ # output_file_path = '/home/bharathi/langchain_experiments/GenAI/transcript_end.json'
21
+
22
+ # # Call the function to calculate the "end" values and remove "duration" and save the new JSON file
23
+ # calculate_end_from_file(input_file_path, output_file_path)
24
+
ingest.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import os
3
+ from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
4
+
5
+ import click
6
+ import torch
7
+ from langchain.docstore.document import Document
8
+ from langchain.embeddings import HuggingFaceInstructEmbeddings
9
+ from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
10
+ from langchain.vectorstores import Chroma
11
+
12
+ from constants import (
13
+ CHROMA_SETTINGS,
14
+ DOCUMENT_MAP,
15
+ EMBEDDING_MODEL_NAME,
16
+ INGEST_THREADS,
17
+ PERSIST_DIRECTORY,
18
+ SOURCE_DIRECTORY,
19
+ )
20
+
21
+
22
+ def load_single_document(file_path: str) -> Document:
23
+ # Loads a single document from a file path
24
+ file_extension = os.path.splitext(file_path)[1]
25
+ loader_class = DOCUMENT_MAP.get(file_extension)
26
+ if loader_class:
27
+ loader = loader_class(file_path)
28
+ else:
29
+ raise ValueError("Document type is undefined")
30
+ return loader.load()[0]
31
+
32
+
33
+ def load_document_batch(filepaths):
34
+ logging.info("Loading document batch")
35
+ # create a thread pool
36
+ with ThreadPoolExecutor(len(filepaths)) as exe:
37
+ # load files
38
+ futures = [exe.submit(load_single_document, name) for name in filepaths]
39
+ # collect data
40
+ data_list = [future.result() for future in futures]
41
+ # return data and file paths
42
+ return (data_list, filepaths)
43
+
44
+
45
+ def load_documents(source_dir: str) -> list[Document]:
46
+ # Loads all documents from the source documents directory
47
+ all_files = os.listdir(source_dir)
48
+ paths = []
49
+ for file_path in all_files:
50
+ file_extension = os.path.splitext(file_path)[1]
51
+ source_file_path = os.path.join(source_dir, file_path)
52
+ if file_extension in DOCUMENT_MAP.keys():
53
+ paths.append(source_file_path)
54
+
55
+ # Have at least one worker and at most INGEST_THREADS workers
56
+ n_workers = min(INGEST_THREADS, max(len(paths), 1))
57
+ chunksize = round(len(paths) / n_workers)
58
+ docs = []
59
+ with ProcessPoolExecutor(n_workers) as executor:
60
+ futures = []
61
+ # split the load operations into chunks
62
+ for i in range(0, len(paths), chunksize):
63
+ # select a chunk of filenames
64
+ filepaths = paths[i : (i + chunksize)]
65
+ # submit the task
66
+ future = executor.submit(load_document_batch, filepaths)
67
+ futures.append(future)
68
+ # process all results
69
+ for future in as_completed(futures):
70
+ # open the file and load the data
71
+ contents, _ = future.result()
72
+ docs.extend(contents)
73
+
74
+ return docs
75
+
76
+
77
+ def split_documents(documents: list[Document]) -> tuple[list[Document], list[Document]]:
78
+ # Splits documents for correct Text Splitter
79
+ text_docs, python_docs = [], []
80
+ for doc in documents:
81
+ file_extension = os.path.splitext(doc.metadata["source"])[1]
82
+ if file_extension == ".py":
83
+ python_docs.append(doc)
84
+ else:
85
+ text_docs.append(doc)
86
+
87
+ return text_docs, python_docs
88
+
89
+ def main():#device_type):
90
+ # Load documents and split in chunks
91
+ logging.info(f"Loading documents from {SOURCE_DIRECTORY}")
92
+ documents = load_documents(SOURCE_DIRECTORY)
93
+ text_documents, python_documents = split_documents(documents)
94
+ text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
95
+ python_splitter = RecursiveCharacterTextSplitter.from_language(
96
+ language=Language.PYTHON, chunk_size=1000, chunk_overlap=200
97
+ )
98
+ texts = text_splitter.split_documents(text_documents)
99
+ texts.extend(python_splitter.split_documents(python_documents))
100
+ logging.info(f"Loaded {len(documents)} documents from {SOURCE_DIRECTORY}")
101
+ logging.info(f"Split into {len(texts)} chunks of text")
102
+
103
+ # Create embeddings
104
+ embeddings = HuggingFaceInstructEmbeddings(
105
+ model_name=EMBEDDING_MODEL_NAME,
106
+ model_kwargs={"device": "cpu"},
107
+ )
108
+ # change the embedding type here if you are running into issues.
109
+ # These are much smaller embeddings and will work for most appications
110
+ # If you use HuggingFaceEmbeddings, make sure to also use the same in the
111
+ # run_localGPT.py file.
112
+
113
+ # embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
114
+
115
+ db = Chroma.from_documents(
116
+ texts,
117
+ embeddings,
118
+ persist_directory=PERSIST_DIRECTORY,
119
+ client_settings=CHROMA_SETTINGS,
120
+ )
121
+ db.persist()
122
+ db = None
123
+
124
+ return "done"
125
+
126
+
127
+ if __name__ == "__main__":
128
+ logging.basicConfig(
129
+ format="%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(message)s", level=logging.INFO
130
+ )
131
+ main()
jira.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Key,Summary,Assignee
2
+ JT-3,Youtube querying,Srikanth Murleedharan
3
+ JT-2,Functionality for Teams transcript,Bharathi Sriram
4
+ JT-1,Multiple-page UI,Lakshmi Narayanan
modJira.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from jira.client import JIRA
2
+ import pandas as pd
3
+
4
+ def get_details(project_id):
5
+ # Specify a server key. It should be your
6
+ # domain name link. yourdomainname.atlassian.net
7
+ jiraOptions = {'server': "https://srikanthnm.atlassian.net"}
8
+
9
+ # Get a JIRA client instance, pass,
10
+ # Authentication parameters
11
+ # and the Server name.
12
+ # emailID = your emailID
13
+ # token = token you receive after registration
14
+ jira = JIRA(options=jiraOptions, basic_auth=(
15
+ "srikanth.nm@gmail.com", "ATATT3xFfGF09rugcjiT06v8xMayt5ggayMNiwz4b6w07PWQxPvpi4fMDzwwHxKt-v8dGx49uiulIMKHUUYroeS8cXvMKYfi7sQnFsYNfGslPVqSq1BQrzPhTio-xmYOHcit5ijzU9cSGGa7eLXUMxQTsSQjLhtZ-EQPI8h6aki690_-evLFZmU=3910FFD4"))
16
+
17
+ # Search all issues mentioned against a project name.
18
+ lstKeys = []
19
+ lstSummary = []
20
+ lstReporter = []
21
+ for singleIssue in jira.search_issues(jql_str=f'project = {project_id}'):
22
+ lstKeys.append(singleIssue.key)
23
+ lstSummary.append(singleIssue.fields.summary)
24
+ lstReporter.append(singleIssue.fields.assignee.displayName)
25
+
26
+ df_output = pd.DataFrame()
27
+ df_output['Key'] = lstKeys
28
+ df_output['Summary'] = lstSummary
29
+ df_output['Assignee'] = lstReporter
30
+
31
+ df_output.to_csv('jira.csv', index=False)
32
+
33
+ return df_output
34
+
35
+ def ask_question(strQuery):
36
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
37
+ import pandas as pd
38
+
39
+ tokenizer = AutoTokenizer.from_pretrained("Yale-LILY/reastap-large")
40
+ model = AutoModelForSeq2SeqLM.from_pretrained("Yale-LILY/reastap-large")
41
+
42
+ table = pd.read_csv("jira.csv")
43
+
44
+ query = strQuery
45
+ encoding = tokenizer(table=table, query=query, return_tensors="pt")
46
+
47
+ outputs = model.generate(**encoding)
48
+
49
+ return (tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
requirements.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Natural Language Processing
2
+ langchain==0.0.191
3
+ chromadb==0.3.22
4
+ llama-cpp-python==0.1.66
5
+ pdfminer.six==20221105
6
+ InstructorEmbedding
7
+ sentence-transformers
8
+ faiss-cpu
9
+ huggingface_hub
10
+ transformers
11
+ protobuf==3.20.0; sys_platform != 'darwin'
12
+ protobuf==3.20.0; sys_platform == 'darwin' and platform_machine != 'arm64'
13
+ protobuf==3.20.3; sys_platform == 'darwin' and platform_machine == 'arm64'
14
+ auto-gptq==0.2.2
15
+ docx2txt
16
+
17
+ # Utilities
18
+ urllib3==1.26.6
19
+ accelerate
20
+ bitsandbytes ; sys_platform != 'win32'
21
+ bitsandbytes-windows ; sys_platform == 'win32'
22
+ click
23
+ flask
24
+ requests
25
+
26
+ # Excel File Manipulation
27
+ openpyxl
28
+
29
+ #custom
30
+ youtube-transcript-api
31
+ streamlit
32
+ jira
33
+
34
+
run_localGPT.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+
3
+ import click
4
+ import torch
5
+ from auto_gptq import AutoGPTQForCausalLM
6
+ from huggingface_hub import hf_hub_download
7
+ from langchain.chains import RetrievalQA
8
+ from langchain.embeddings import HuggingFaceInstructEmbeddings
9
+ from langchain.llms import HuggingFacePipeline, LlamaCpp
10
+
11
+ # from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
12
+ from langchain.vectorstores import Chroma
13
+ from transformers import (
14
+ AutoModelForCausalLM,
15
+ AutoTokenizer,
16
+ GenerationConfig,
17
+ LlamaForCausalLM,
18
+ LlamaTokenizer,
19
+ pipeline,
20
+ )
21
+
22
+ from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY
23
+
24
+
25
+ def load_model(device_type, model_id, model_basename=None):
26
+ """
27
+ Select a model for text generation using the HuggingFace library.
28
+ If you are running this for the first time, it will download a model for you.
29
+ subsequent runs will use the model from the disk.
30
+
31
+ Args:
32
+ device_type (str): Type of device to use, e.g., "cuda" for GPU or "cpu" for CPU.
33
+ model_id (str): Identifier of the model to load from HuggingFace's model hub.
34
+ model_basename (str, optional): Basename of the model if using quantized models.
35
+ Defaults to None.
36
+
37
+ Returns:
38
+ HuggingFacePipeline: A pipeline object for text generation using the loaded model.
39
+
40
+ Raises:
41
+ ValueError: If an unsupported model or device type is provided.
42
+ """
43
+ logging.info(f"Loading Model: {model_id}, on: {device_type}")
44
+ logging.info("This action can take a few minutes!")
45
+
46
+ if model_basename is not None:
47
+ if ".ggml" in model_basename:
48
+ logging.info("Using Llamacpp for GGML quantized models")
49
+ model_path = hf_hub_download(repo_id=model_id, filename=model_basename)
50
+ max_ctx_size = 2048
51
+ kwargs = {
52
+ "model_path": model_path,
53
+ "n_ctx": max_ctx_size,
54
+ "max_tokens": max_ctx_size,
55
+ }
56
+ if device_type.lower() == "mps":
57
+ kwargs["n_gpu_layers"] = 1000
58
+ if device_type.lower() == "cuda":
59
+ kwargs["n_gpu_layers"] = 1000
60
+ kwargs["n_batch"] = max_ctx_size
61
+ return LlamaCpp(**kwargs)
62
+
63
+ else:
64
+ # The code supports all huggingface models that ends with GPTQ and have some variation
65
+ # of .no-act.order or .safetensors in their HF repo.
66
+ logging.info("Using AutoGPTQForCausalLM for quantized models")
67
+
68
+ if ".safetensors" in model_basename:
69
+ # Remove the ".safetensors" ending if present
70
+ model_basename = model_basename.replace(".safetensors", "")
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
73
+ logging.info("Tokenizer loaded")
74
+
75
+ model = AutoGPTQForCausalLM.from_quantized(
76
+ model_id,
77
+ model_basename=model_basename,
78
+ use_safetensors=True,
79
+ trust_remote_code=True,
80
+ device="cuda:0",
81
+ use_triton=False,
82
+ quantize_config=None,
83
+ )
84
+ elif (
85
+ device_type.lower() == "cuda"
86
+ ): # The code supports all huggingface models that ends with -HF or which have a .bin
87
+ # file in their HF repo.
88
+ logging.info("Using AutoModelForCausalLM for full models")
89
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
90
+ logging.info("Tokenizer loaded")
91
+
92
+ model = AutoModelForCausalLM.from_pretrained(
93
+ model_id,
94
+ device_map="auto",
95
+ torch_dtype=torch.float16,
96
+ low_cpu_mem_usage=True,
97
+ trust_remote_code=True,
98
+ # max_memory={0: "15GB"} # Uncomment this line with you encounter CUDA out of memory errors
99
+ )
100
+ model.tie_weights()
101
+ else:
102
+ logging.info("Using LlamaTokenizer")
103
+ tokenizer = LlamaTokenizer.from_pretrained(model_id)
104
+ model = LlamaForCausalLM.from_pretrained(model_id)
105
+
106
+ # Load configuration from the model to avoid warnings
107
+ generation_config = GenerationConfig.from_pretrained(model_id)
108
+ # see here for details:
109
+ # https://huggingface.co/docs/transformers/
110
+ # main_classes/text_generation#transformers.GenerationConfig.from_pretrained.returns
111
+
112
+ # Create a pipeline for text generation
113
+ pipe = pipeline(
114
+ "text-generation",
115
+ model=model,
116
+ tokenizer=tokenizer,
117
+ max_length=2048,
118
+ temperature=0,
119
+ top_p=0.95,
120
+ repetition_penalty=1.15,
121
+ generation_config=generation_config,
122
+ )
123
+
124
+ local_llm = HuggingFacePipeline(pipeline=pipe)
125
+ logging.info("Local LLM Loaded")
126
+
127
+ return local_llm
128
+
129
+
130
+ # # chose device typ to run on as well as to show source documents.
131
+ # @click.command()
132
+ # @click.option(
133
+ # "--device_type",
134
+ # default="cuda" if torch.cuda.is_available() else "cpu",
135
+ # type=click.Choice(
136
+ # [
137
+ # "cpu",
138
+ # "cuda",
139
+ # "ipu",
140
+ # "xpu",
141
+ # "mkldnn",
142
+ # "opengl",
143
+ # "opencl",
144
+ # "ideep",
145
+ # "hip",
146
+ # "ve",
147
+ # "fpga",
148
+ # "ort",
149
+ # "xla",
150
+ # "lazy",
151
+ # "vulkan",
152
+ # "mps",
153
+ # "meta",
154
+ # "hpu",
155
+ # "mtia",
156
+ # ],
157
+ # ),
158
+ # help="Device to run on. (Default is cuda)",
159
+ # )
160
+ # @click.option(
161
+ # "--show_sources",
162
+ # "-s",
163
+ # is_flag=True,
164
+ # help="Show sources along with answers (Default is False)",
165
+ # )
166
+ def main(device_type, strQuery):
167
+ """
168
+ This function implements the information retrieval task.
169
+
170
+
171
+ 1. Loads an embedding model, can be HuggingFaceInstructEmbeddings or HuggingFaceEmbeddings
172
+ 2. Loads the existing vectorestore that was created by inget.py
173
+ 3. Loads the local LLM using load_model function - You can now set different LLMs.
174
+ 4. Setup the Question Answer retreival chain.
175
+ 5. Question answers.
176
+ """
177
+
178
+ logging.info(f"Running on: {device_type}")
179
+ #logging.info(f"Display Source Documents set to: {show_sources}")
180
+
181
+ embeddings = HuggingFaceInstructEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={"device": device_type})
182
+
183
+ # uncomment the following line if you used HuggingFaceEmbeddings in the ingest.py
184
+ # embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
185
+
186
+ # load the vectorstore
187
+ db = Chroma(
188
+ persist_directory=PERSIST_DIRECTORY,
189
+ embedding_function=embeddings,
190
+ client_settings=CHROMA_SETTINGS,
191
+ )
192
+ retriever = db.as_retriever()
193
+
194
+ # load the LLM for generating Natural Language responses
195
+
196
+ # for HF models
197
+ # model_id = "TheBloke/vicuna-7B-1.1-HF"
198
+ # model_basename = None
199
+ # model_id = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
200
+ # model_id = "TheBloke/guanaco-7B-HF"
201
+ # model_id = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM. Using STransformers
202
+ # alongside will 100% create OOM on 24GB cards.
203
+ # llm = load_model(device_type, model_id=model_id)
204
+
205
+ # for GPTQ (quantized) models
206
+ # model_id = "TheBloke/Nous-Hermes-13B-GPTQ"
207
+ # model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
208
+ # model_id = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
209
+ # model_basename = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors" # Requires
210
+ # ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
211
+ # model_id = "TheBloke/wizardLM-7B-GPTQ"
212
+ # model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
213
+ # model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
214
+ # model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
215
+
216
+ # for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
217
+ # model_id = "TheBloke/wizard-vicuna-13B-GGML"
218
+ # model_basename = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
219
+ # model_basename = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
220
+ # model_basename = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
221
+ # model_id = "TheBloke/orca_mini_3B-GGML"
222
+ # model_basename = "orca-mini-3b.ggmlv3.q4_0.bin"
223
+
224
+ model_id = "TheBloke/Llama-2-7B-Chat-GGML"
225
+ model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
226
+
227
+ llm = load_model(device_type, model_id=model_id, model_basename=model_basename)
228
+
229
+ qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
230
+ # Interactive questions and answers
231
+
232
+ query = strQuery
233
+ # Get the answer from the chain
234
+ res = qa(query)
235
+ answer, docs = res["result"], res["source_documents"]
236
+
237
+ return(answer)
238
+
239
+
240
+
241
+ if __name__ == "__main__":
242
+ logging.basicConfig(
243
+ format="%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(message)s", level=logging.INFO
244
+ )
245
+ main()
similarity.py ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from sentence_transformers import SentenceTransformer, util
2
+ import json
3
+ import numpy as np
4
+ model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
5
+
6
+ def similarity(strQuery):
7
+
8
+ inputs = json.load(open('chunks.json','r'))
9
+ lstCorpus = [dct['text'] for dct in inputs]
10
+
11
+ strQuery = "How many different document types?"
12
+ qryEmbedding = model.encode(strQuery, convert_to_tensor=True)
13
+ corpusEmbedding= model.encode(lstCorpus, convert_to_tensor=True)
14
+
15
+ sim_mat = util.pytorch_cos_sim(qryEmbedding, corpusEmbedding)
16
+ lstSim = sim_mat[0].tolist()
17
+ npSim = np.array(lstSim)
18
+ indexMax = npSim.argmax()
19
+ scoreMax = npSim.max()
20
+
21
+ return(inputs[indexMax]['start'], inputs[indexMax]['end'])
22
+
transcript.json ADDED
@@ -0,0 +1,312 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "text": "In this video, I'm going to answer the top 3 questions",
4
+ "start": 0.0,
5
+ "duration": 4.0
6
+ },
7
+ {
8
+ "text": "my students ask me about Python. What is Python? What ",
9
+ "start": 4.0,
10
+ "duration": 4.0
11
+ },
12
+ {
13
+ "text": "can you do with it? And why is it so popular? In other words, what",
14
+ "start": 8.0,
15
+ "duration": 4.0
16
+ },
17
+ {
18
+ "text": "does it do that other programming languages don't? Python is the ",
19
+ "start": 12.0,
20
+ "duration": 4.0
21
+ },
22
+ {
23
+ "text": "world's fastest growing and most popular programming language, not just ",
24
+ "start": 16.0,
25
+ "duration": 4.0
26
+ },
27
+ {
28
+ "text": "amongst software engineers, but also amongst mathematicians, ",
29
+ "start": 20.0,
30
+ "duration": 4.0
31
+ },
32
+ {
33
+ "text": "data analysts, scientists, accountants, networking engineers,",
34
+ "start": 24.0,
35
+ "duration": 4.0
36
+ },
37
+ {
38
+ "text": "and even kids! Because it's a very beginner friendly programming ",
39
+ "start": 28.0,
40
+ "duration": 4.0
41
+ },
42
+ {
43
+ "text": "language. So people from different disciplines use Python",
44
+ "start": 32.0,
45
+ "duration": 4.0
46
+ },
47
+ {
48
+ "text": "for a variety of different tasks, such as data analysis and visualization, ",
49
+ "start": 36.0,
50
+ "duration": 4.0
51
+ },
52
+ {
53
+ "text": "artificial intelligence and machine learning, automation ",
54
+ "start": 40.0,
55
+ "duration": 4.0
56
+ },
57
+ {
58
+ "text": "in fact this is one of the big uses of Python amongst people who are not software",
59
+ "start": 44.0,
60
+ "duration": 4.0
61
+ },
62
+ {
63
+ "text": "developers. If you constantly have to do boring, repetitive ",
64
+ "start": 48.0,
65
+ "duration": 4.0
66
+ },
67
+ {
68
+ "text": "tasks, such as copying files and folders around, renaming them, ",
69
+ "start": 52.0,
70
+ "duration": 4.0
71
+ },
72
+ {
73
+ "text": "uploading them to a server, you can easily write a Python script to",
74
+ "start": 56.0,
75
+ "duration": 4.0
76
+ },
77
+ {
78
+ "text": "automate all that and save your time. And that's just one example, if you",
79
+ "start": 60.0,
80
+ "duration": 4.0
81
+ },
82
+ {
83
+ "text": "continuously have to work with excel spreadsheets, PDF's, CS",
84
+ "start": 64.0,
85
+ "duration": 4.0
86
+ },
87
+ {
88
+ "text": "View files, download websites and parse them, you can automate all",
89
+ "start": 68.0,
90
+ "duration": 4.0
91
+ },
92
+ {
93
+ "text": "that stuff with Python. So you don't have to be a software developer to use Python.",
94
+ "start": 72.0,
95
+ "duration": 4.0
96
+ },
97
+ {
98
+ "text": "You could be an accountant, a mathematician, or a scientist, and use Python ",
99
+ "start": 76.0,
100
+ "duration": 4.0
101
+ },
102
+ {
103
+ "text": "to make your life easier. You can also use Python to build ",
104
+ "start": 80.0,
105
+ "duration": 4.0
106
+ },
107
+ {
108
+ "text": "web, mobile and desktop applications as well as software ",
109
+ "start": 84.0,
110
+ "duration": 4.0
111
+ },
112
+ {
113
+ "text": "testing or even hacking. So Python is a multi purpose language. ",
114
+ "start": 88.0,
115
+ "duration": 4.0
116
+ },
117
+ {
118
+ "text": "Now if you have some programming experience you may say, \"But Mosh",
119
+ "start": 92.0,
120
+ "duration": 4.0
121
+ },
122
+ {
123
+ "text": "we can do all this stuff with other programming languages, so what's the big deal ",
124
+ "start": 96.0,
125
+ "duration": 4.0
126
+ },
127
+ {
128
+ "text": "about Python?\" Here are a few reasons. With Python you can ",
129
+ "start": 100.0,
130
+ "duration": 4.0
131
+ },
132
+ {
133
+ "text": "solve complex problems in less time with fewer lines of code. ",
134
+ "start": 104.0,
135
+ "duration": 4.0
136
+ },
137
+ {
138
+ "text": "Here's an example. Let's say we want to extract the first three ",
139
+ "start": 108.0,
140
+ "duration": 4.0
141
+ },
142
+ {
143
+ "text": "letters of the text Hello World. This is the code we have to write ",
144
+ "start": 112.0,
145
+ "duration": 4.0
146
+ },
147
+ {
148
+ "text": "in C# this is how we do it in JavaScript and here's how we ",
149
+ "start": 116.0,
150
+ "duration": 4.0
151
+ },
152
+ {
153
+ "text": "do it in Python. See how short and clean the language is?",
154
+ "start": 120.0,
155
+ "duration": 4.0
156
+ },
157
+ {
158
+ "text": "And that's just the beginning. Python makes a lot of trivial things",
159
+ "start": 124.0,
160
+ "duration": 4.0
161
+ },
162
+ {
163
+ "text": "really easy with a simple yet powerful syntax. Here are a few",
164
+ "start": 128.0,
165
+ "duration": 4.0
166
+ },
167
+ {
168
+ "text": "other reasons Python is so popular. It's a high level language",
169
+ "start": 132.0,
170
+ "duration": 4.0
171
+ },
172
+ {
173
+ "text": "so you don't have to worry about complex tasks such as memory management, ",
174
+ "start": 136.0,
175
+ "duration": 4.0
176
+ },
177
+ {
178
+ "text": "like you do in C++. It's cross platform which means ",
179
+ "start": 140.0,
180
+ "duration": 4.0
181
+ },
182
+ {
183
+ "text": "you can build and run Python applications on Windows, Mac, ",
184
+ "start": 144.0,
185
+ "duration": 4.0
186
+ },
187
+ {
188
+ "text": "and Linux. It has a huge community so whenever you get ",
189
+ "start": 148.0,
190
+ "duration": 4.0
191
+ },
192
+ {
193
+ "text": "stuck, there is someone out there to help. It has a large ecosystem ",
194
+ "start": 152.0,
195
+ "duration": 4.0
196
+ },
197
+ {
198
+ "text": "of libraries, frameworks and tools which means whatever you wanna do",
199
+ "start": 156.0,
200
+ "duration": 4.0
201
+ },
202
+ {
203
+ "text": "it is likely that someone else has done it before because Python has been around ",
204
+ "start": 160.0,
205
+ "duration": 4.0
206
+ },
207
+ {
208
+ "text": "for over 20 years. So in a nutshell, Python",
209
+ "start": 164.0,
210
+ "duration": 4.0
211
+ },
212
+ {
213
+ "text": "is a multi-purpose language with a simple, clean, and beginner-friendly ",
214
+ "start": 168.0,
215
+ "duration": 4.0
216
+ },
217
+ {
218
+ "text": "syntax. All of that means Python is awesome.",
219
+ "start": 172.0,
220
+ "duration": 4.0
221
+ },
222
+ {
223
+ "text": "Technically everything you do with Python you can do with other programming languages, ",
224
+ "start": 176.0,
225
+ "duration": 4.0
226
+ },
227
+ {
228
+ "text": "but Python's simplicity and elegance has made it grow way ",
229
+ "start": 180.0,
230
+ "duration": 4.0
231
+ },
232
+ {
233
+ "text": "more than other programming languages. That's why it's the number onne",
234
+ "start": 184.0,
235
+ "duration": 4.0
236
+ },
237
+ {
238
+ "text": "language employers are looking for. So whether you're a programmer or ",
239
+ "start": 188.0,
240
+ "duration": 4.0
241
+ },
242
+ {
243
+ "text": "an absolute beginner, learning Python opens up lots of job opportunities ",
244
+ "start": 192.0,
245
+ "duration": 4.0
246
+ },
247
+ {
248
+ "text": "to you. In fact, the average Python developer earns a whopping",
249
+ "start": 196.0,
250
+ "duration": 4.0
251
+ },
252
+ {
253
+ "text": "116,000 dollars a year. If you",
254
+ "start": 200.0,
255
+ "duration": 4.0
256
+ },
257
+ {
258
+ "text": "found this video helpful, please support my hard work by liking and sharing it with others. ",
259
+ "start": 204.0,
260
+ "duration": 4.0
261
+ },
262
+ {
263
+ "text": "Also, be sure to subscribe to my channel, because I have a couple of",
264
+ "start": 208.0,
265
+ "duration": 4.0
266
+ },
267
+ {
268
+ "text": "awesome Python tutorials for you, you're going to see them on the screen now. ",
269
+ "start": 212.0,
270
+ "duration": 4.0
271
+ },
272
+ {
273
+ "text": "Here's my Python tutorial for beginners, it's a great starting point if you ",
274
+ "start": 216.0,
275
+ "duration": 4.0
276
+ },
277
+ {
278
+ "text": "have limited or no programming experience. On the other hand, if you ",
279
+ "start": 220.0,
280
+ "duration": 4.0
281
+ },
282
+ {
283
+ "text": "do have some programming experience and want to quickly get up to speed with Python, ",
284
+ "start": 224.0,
285
+ "duration": 4.0
286
+ },
287
+ {
288
+ "text": "I have another tutorial just for you. I'm not going to waste your time ",
289
+ "start": 228.0,
290
+ "duration": 4.0
291
+ },
292
+ {
293
+ "text": "telling you what a variable or a function is. I will talk to you like a programmer.",
294
+ "start": 232.0,
295
+ "duration": 4.0
296
+ },
297
+ {
298
+ "text": "There's never been a better time to master Python programming,",
299
+ "start": 236.0,
300
+ "duration": 4.0
301
+ },
302
+ {
303
+ "text": "so click on the tutorial that is right for you and get started. Thank you for",
304
+ "start": 240.0,
305
+ "duration": 4.0
306
+ },
307
+ {
308
+ "text": "watching!",
309
+ "start": 244.0,
310
+ "duration": 2.633
311
+ }
312
+ ]
transcript_end.json ADDED
@@ -0,0 +1,312 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "text": "In this video, I'm going to answer the top 3 questions",
4
+ "start": 0.0,
5
+ "end": 4.0
6
+ },
7
+ {
8
+ "text": "my students ask me about Python. What is Python? What ",
9
+ "start": 4.0,
10
+ "end": 8.0
11
+ },
12
+ {
13
+ "text": "can you do with it? And why is it so popular? In other words, what",
14
+ "start": 8.0,
15
+ "end": 12.0
16
+ },
17
+ {
18
+ "text": "does it do that other programming languages don't? Python is the ",
19
+ "start": 12.0,
20
+ "end": 16.0
21
+ },
22
+ {
23
+ "text": "world's fastest growing and most popular programming language, not just ",
24
+ "start": 16.0,
25
+ "end": 20.0
26
+ },
27
+ {
28
+ "text": "amongst software engineers, but also amongst mathematicians, ",
29
+ "start": 20.0,
30
+ "end": 24.0
31
+ },
32
+ {
33
+ "text": "data analysts, scientists, accountants, networking engineers,",
34
+ "start": 24.0,
35
+ "end": 28.0
36
+ },
37
+ {
38
+ "text": "and even kids! Because it's a very beginner friendly programming ",
39
+ "start": 28.0,
40
+ "end": 32.0
41
+ },
42
+ {
43
+ "text": "language. So people from different disciplines use Python",
44
+ "start": 32.0,
45
+ "end": 36.0
46
+ },
47
+ {
48
+ "text": "for a variety of different tasks, such as data analysis and visualization, ",
49
+ "start": 36.0,
50
+ "end": 40.0
51
+ },
52
+ {
53
+ "text": "artificial intelligence and machine learning, automation ",
54
+ "start": 40.0,
55
+ "end": 44.0
56
+ },
57
+ {
58
+ "text": "in fact this is one of the big uses of Python amongst people who are not software",
59
+ "start": 44.0,
60
+ "end": 48.0
61
+ },
62
+ {
63
+ "text": "developers. If you constantly have to do boring, repetitive ",
64
+ "start": 48.0,
65
+ "end": 52.0
66
+ },
67
+ {
68
+ "text": "tasks, such as copying files and folders around, renaming them, ",
69
+ "start": 52.0,
70
+ "end": 56.0
71
+ },
72
+ {
73
+ "text": "uploading them to a server, you can easily write a Python script to",
74
+ "start": 56.0,
75
+ "end": 60.0
76
+ },
77
+ {
78
+ "text": "automate all that and save your time. And that's just one example, if you",
79
+ "start": 60.0,
80
+ "end": 64.0
81
+ },
82
+ {
83
+ "text": "continuously have to work with excel spreadsheets, PDF's, CS",
84
+ "start": 64.0,
85
+ "end": 68.0
86
+ },
87
+ {
88
+ "text": "View files, download websites and parse them, you can automate all",
89
+ "start": 68.0,
90
+ "end": 72.0
91
+ },
92
+ {
93
+ "text": "that stuff with Python. So you don't have to be a software developer to use Python.",
94
+ "start": 72.0,
95
+ "end": 76.0
96
+ },
97
+ {
98
+ "text": "You could be an accountant, a mathematician, or a scientist, and use Python ",
99
+ "start": 76.0,
100
+ "end": 80.0
101
+ },
102
+ {
103
+ "text": "to make your life easier. You can also use Python to build ",
104
+ "start": 80.0,
105
+ "end": 84.0
106
+ },
107
+ {
108
+ "text": "web, mobile and desktop applications as well as software ",
109
+ "start": 84.0,
110
+ "end": 88.0
111
+ },
112
+ {
113
+ "text": "testing or even hacking. So Python is a multi purpose language. ",
114
+ "start": 88.0,
115
+ "end": 92.0
116
+ },
117
+ {
118
+ "text": "Now if you have some programming experience you may say, \"But Mosh",
119
+ "start": 92.0,
120
+ "end": 96.0
121
+ },
122
+ {
123
+ "text": "we can do all this stuff with other programming languages, so what's the big deal ",
124
+ "start": 96.0,
125
+ "end": 100.0
126
+ },
127
+ {
128
+ "text": "about Python?\" Here are a few reasons. With Python you can ",
129
+ "start": 100.0,
130
+ "end": 104.0
131
+ },
132
+ {
133
+ "text": "solve complex problems in less time with fewer lines of code. ",
134
+ "start": 104.0,
135
+ "end": 108.0
136
+ },
137
+ {
138
+ "text": "Here's an example. Let's say we want to extract the first three ",
139
+ "start": 108.0,
140
+ "end": 112.0
141
+ },
142
+ {
143
+ "text": "letters of the text Hello World. This is the code we have to write ",
144
+ "start": 112.0,
145
+ "end": 116.0
146
+ },
147
+ {
148
+ "text": "in C# this is how we do it in JavaScript and here's how we ",
149
+ "start": 116.0,
150
+ "end": 120.0
151
+ },
152
+ {
153
+ "text": "do it in Python. See how short and clean the language is?",
154
+ "start": 120.0,
155
+ "end": 124.0
156
+ },
157
+ {
158
+ "text": "And that's just the beginning. Python makes a lot of trivial things",
159
+ "start": 124.0,
160
+ "end": 128.0
161
+ },
162
+ {
163
+ "text": "really easy with a simple yet powerful syntax. Here are a few",
164
+ "start": 128.0,
165
+ "end": 132.0
166
+ },
167
+ {
168
+ "text": "other reasons Python is so popular. It's a high level language",
169
+ "start": 132.0,
170
+ "end": 136.0
171
+ },
172
+ {
173
+ "text": "so you don't have to worry about complex tasks such as memory management, ",
174
+ "start": 136.0,
175
+ "end": 140.0
176
+ },
177
+ {
178
+ "text": "like you do in C++. It's cross platform which means ",
179
+ "start": 140.0,
180
+ "end": 144.0
181
+ },
182
+ {
183
+ "text": "you can build and run Python applications on Windows, Mac, ",
184
+ "start": 144.0,
185
+ "end": 148.0
186
+ },
187
+ {
188
+ "text": "and Linux. It has a huge community so whenever you get ",
189
+ "start": 148.0,
190
+ "end": 152.0
191
+ },
192
+ {
193
+ "text": "stuck, there is someone out there to help. It has a large ecosystem ",
194
+ "start": 152.0,
195
+ "end": 156.0
196
+ },
197
+ {
198
+ "text": "of libraries, frameworks and tools which means whatever you wanna do",
199
+ "start": 156.0,
200
+ "end": 160.0
201
+ },
202
+ {
203
+ "text": "it is likely that someone else has done it before because Python has been around ",
204
+ "start": 160.0,
205
+ "end": 164.0
206
+ },
207
+ {
208
+ "text": "for over 20 years. So in a nutshell, Python",
209
+ "start": 164.0,
210
+ "end": 168.0
211
+ },
212
+ {
213
+ "text": "is a multi-purpose language with a simple, clean, and beginner-friendly ",
214
+ "start": 168.0,
215
+ "end": 172.0
216
+ },
217
+ {
218
+ "text": "syntax. All of that means Python is awesome.",
219
+ "start": 172.0,
220
+ "end": 176.0
221
+ },
222
+ {
223
+ "text": "Technically everything you do with Python you can do with other programming languages, ",
224
+ "start": 176.0,
225
+ "end": 180.0
226
+ },
227
+ {
228
+ "text": "but Python's simplicity and elegance has made it grow way ",
229
+ "start": 180.0,
230
+ "end": 184.0
231
+ },
232
+ {
233
+ "text": "more than other programming languages. That's why it's the number onne",
234
+ "start": 184.0,
235
+ "end": 188.0
236
+ },
237
+ {
238
+ "text": "language employers are looking for. So whether you're a programmer or ",
239
+ "start": 188.0,
240
+ "end": 192.0
241
+ },
242
+ {
243
+ "text": "an absolute beginner, learning Python opens up lots of job opportunities ",
244
+ "start": 192.0,
245
+ "end": 196.0
246
+ },
247
+ {
248
+ "text": "to you. In fact, the average Python developer earns a whopping",
249
+ "start": 196.0,
250
+ "end": 200.0
251
+ },
252
+ {
253
+ "text": "116,000 dollars a year. If you",
254
+ "start": 200.0,
255
+ "end": 204.0
256
+ },
257
+ {
258
+ "text": "found this video helpful, please support my hard work by liking and sharing it with others. ",
259
+ "start": 204.0,
260
+ "end": 208.0
261
+ },
262
+ {
263
+ "text": "Also, be sure to subscribe to my channel, because I have a couple of",
264
+ "start": 208.0,
265
+ "end": 212.0
266
+ },
267
+ {
268
+ "text": "awesome Python tutorials for you, you're going to see them on the screen now. ",
269
+ "start": 212.0,
270
+ "end": 216.0
271
+ },
272
+ {
273
+ "text": "Here's my Python tutorial for beginners, it's a great starting point if you ",
274
+ "start": 216.0,
275
+ "end": 220.0
276
+ },
277
+ {
278
+ "text": "have limited or no programming experience. On the other hand, if you ",
279
+ "start": 220.0,
280
+ "end": 224.0
281
+ },
282
+ {
283
+ "text": "do have some programming experience and want to quickly get up to speed with Python, ",
284
+ "start": 224.0,
285
+ "end": 228.0
286
+ },
287
+ {
288
+ "text": "I have another tutorial just for you. I'm not going to waste your time ",
289
+ "start": 228.0,
290
+ "end": 232.0
291
+ },
292
+ {
293
+ "text": "telling you what a variable or a function is. I will talk to you like a programmer.",
294
+ "start": 232.0,
295
+ "end": 236.0
296
+ },
297
+ {
298
+ "text": "There's never been a better time to master Python programming,",
299
+ "start": 236.0,
300
+ "end": 240.0
301
+ },
302
+ {
303
+ "text": "so click on the tutorial that is right for you and get started. Thank you for",
304
+ "start": 240.0,
305
+ "end": 244.0
306
+ },
307
+ {
308
+ "text": "watching!",
309
+ "start": 244.0,
310
+ "end": 246.63
311
+ }
312
+ ]
utils.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+
3
+ def calculate_ends(input_file_path, output_file_path):
4
+ with open(input_file_path, 'r') as file:
5
+ input_data = json.load(file)
6
+
7
+ # Iterate through the list of dictionaries and calculate "end" for each one
8
+ for item in input_data:
9
+ item["end"] =round(item["start"] + item["duration"],2)
10
+ del item["duration"] # Remove the "duration" key from each dictionary
11
+
12
+ # Save the updated data to a new JSON file
13
+ with open(output_file_path, 'w') as output_file:
14
+ json.dump(input_data, output_file)
15
+
16
+ import json
17
+
18
+ def create_chunks(input_file_path, output_file_path):
19
+ with open(input_file_path, 'r') as file:
20
+ output_data = json.load(file)
21
+
22
+ combined_json_list = []
23
+
24
+ # Calculate the number of groups to create
25
+ num_groups = (len(output_data) + 3) // 4
26
+
27
+ for group_num in range(num_groups):
28
+ # Calculate the starting index and ending index for the current group
29
+ start_index = group_num * 4
30
+ end_index = min(start_index + 4, len(output_data))
31
+
32
+ # Extract the "text" values from the current group of dictionaries
33
+ combined_text = " ".join([item["text"] for item in output_data[start_index:end_index]])
34
+
35
+ # Calculate the "start" and "end" for the current group
36
+ group_start = output_data[start_index]["start"]
37
+ group_end = output_data[end_index - 1]["end"]
38
+
39
+ # Create the combined JSON for the current group
40
+ combined_json = {
41
+ "text": combined_text,
42
+ "start": group_start,
43
+ "end": group_end,
44
+ }
45
+
46
+ combined_json_list.append(combined_json)
47
+
48
+ # Save the combined JSON list to a new file
49
+ with open(output_file_path, 'w') as output_file:
50
+ json.dump(combined_json_list, output_file)
51
+
youtube.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from youtube_transcript_api import YouTubeTranscriptApi
2
+ from youtube_transcript_api.formatters import JSONFormatter
3
+ import json
4
+ import ingest
5
+ import run_localGPT
6
+ import utils
7
+
8
+ def audio_to_transcript(video_id):
9
+ sub = YouTubeTranscriptApi.get_transcript(video_id)
10
+ formatted_subs = JSONFormatter().format_transcript(transcript=sub)
11
+ with open("transcript.json", "w") as outfile:
12
+ json.dump(sub, outfile)
13
+ lstTexts = []
14
+ for dct in sub:
15
+ lstTexts.append(dct['text'])
16
+ strResult = ' '.join(lstTexts)
17
+ with open('SOURCE_DOCUMENTS/transcript.txt', 'w') as outfile:
18
+ outfile.write(strResult)
19
+ transcript = ' '.join(lstTexts)
20
+
21
+ utils.calculate_ends('transcript.json','transcript_end.json')
22
+ utils.create_chunks('transcript_end.json','chunks.json')
23
+
24
+ return transcript
25
+
26
+ def start_training():
27
+ training_status = ingest.main()
28
+ return training_status
29
+
30
+ def replace_substring_and_following(input_string, substring):
31
+ index = input_string.find(substring)
32
+ if index != -1:
33
+ return input_string[:index]
34
+ else:
35
+ return input_string
36
+
37
+ def ask_question(strQuestion):
38
+ answer = run_localGPT.main(device_type='cpu', strQuery=strQuestion)
39
+ answer_cleaned = replace_substring_and_following(answer, "Unhelpful Answer")
40
+ return answer_cleaned
41
+
42
+ def summarize():
43
+
44
+ from langchain.text_splitter import CharacterTextSplitter
45
+ from langchain.chains.mapreduce import MapReduceChain
46
+ from langchain.prompts import PromptTemplate
47
+
48
+ model_id = "TheBloke/Llama-2-7B-Chat-GGML"
49
+ model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
50
+
51
+ llm = run_localGPT.load_model(device_type='cpu', model_id=model_id, model_basename=model_basename)
52
+
53
+ text_splitter = CharacterTextSplitter()
54
+
55
+ with open("SOURCE_DOCUMENTS/transcript.txt") as f:
56
+ file_content = f.read()
57
+ texts = text_splitter.split_text(file_content)
58
+
59
+ from langchain.docstore.document import Document
60
+
61
+ docs = [Document(page_content=t) for t in texts]
62
+
63
+ from langchain.chains.summarize import load_summarize_chain
64
+
65
+ chain = load_summarize_chain(llm, chain_type="map_reduce")
66
+ summary = chain.run(docs)
67
+ return summary
68
+