Spaces:

MachineLearningReply
/

q-and-a-tool-custom-logo

Sleeping

App Files Files Community

hkoppen commited on Jun 25, 2024

Commit

7ca3dfa

verified ·

1 Parent(s): bebe5de

Upload 23 files

Browse files

Files changed (23) hide show

.DS_Store +0 -0
.github/workflows/main.yml +19 -0
.gitignore +47 -0
.streamlit/config.toml +6 -0
.vscode/settings.json +11 -0
Dockerfile +29 -0
README.md +108 -12
__pycache__/document_qa_engine.cpython-310.pyc +0 -0
__pycache__/utils.cpython-310.pyc +0 -0
app.py +241 -0
authenticator_config.yaml +15 -0
document_qa_engine.py +141 -0
requirements.txt +18 -0
resources/ml_logo.png +0 -0
resources/puma.png +0 -0
utils.py +56 -0
utils/__pycache__/config.cpython-38.pyc +0 -0
utils/__pycache__/haystack.cpython-38.pyc +0 -0
utils/__pycache__/ui.cpython-38.pyc +0 -0
utils/check_pydantic_version.py +26 -0
utils/config.py +43 -0
utils/haystack.py +124 -0
utils/ui.py +16 -0

.DS_Store ADDED Viewed

Binary file (10.2 kB). View file

.github/workflows/main.yml ADDED Viewed

	@@ -0,0 +1,19 @@

+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [puma_demo]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push https://hkoppen:$HF_TOKEN@huggingface.co/spaces/MachineLearningReply/q-and-a-tool-custom-logo puma_demo

.gitignore ADDED Viewed

	@@ -0,0 +1,47 @@

+# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
+# dependencies
+node_modules
+.pnp
+.pnp.js
+# testing
+coverage
+# next.js
+.next/
+out/
+build
+# misc
+.DS_Store
+*.pem
+# debug
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+.pnpm-debug.log*
+# local env files
+.env.local
+.env.development.local
+.env.test.local
+.env.production.local
+# turbo
+.turbo
+.contentlayer
+.env
+.vercel
+.vscode
+# JetBrains
+.idea
+# VSCode
+__pycache__/*
+# datasets directory is used for local development
+/datasets/

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,6 @@

+[theme]
+primaryColor = "#E694FF"
+backgroundColor = "#FFFFFF"
+secondaryBackgroundColor = "#F0F0F0"
+textColor = "#262730"
+font = "sans serif"

.vscode/settings.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+    "python.languageServer": "Pylance",
+    "python.analysis.typeCheckingMode": "basic",
+    "typescript.tsserver.maxTsServerMemory": 3072,
+    "typescript.tsserver.watchOptions": {
+        "watchFile": "dynamicPriorityPolling"
+    },
+    "javascript.suggest.includeAutomaticOptionalChainCompletions": false,
+    "debug.saveBeforeStart": "none",
+    "c3.welcome.showFeatureHighlight": false
+}

Dockerfile ADDED Viewed

	@@ -0,0 +1,29 @@

+FROM python:3.10-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    software-properties-common \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt .
+RUN pip3 install -r requirements.txt
+COPY . .
+# extract version
+COPY .git ./.git
+RUN git rev-parse --short HEAD > revision.txt
+RUN rm -rf ./.git
+EXPOSE 8501
+HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
+ENV PYTHONPATH "${PYTHONPATH}:."
+ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -1,12 +1,108 @@
----
-title: Q And A Tool Custom Logo
-emoji: 🏃
-colorFrom: green
-colorTo: gray
-sdk: streamlit
-sdk_version: 1.36.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: NLP Q&A Tool
+emoji: 👑
+colorFrom: indigo
+colorTo: indigo
+sdk: streamlit
+sdk_version: 1.32.2
+app_file: app.py
+pinned: false
+---
+# Document Insights - Extractive & Generative Methods using Haystack
+This template [Streamlit](https://docs.streamlit.io/) app set up for
+simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to
+do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
+Below you will also find instructions on how you
+could [push this to Hugging Face Spaces 🤗](#pushing-to-hugging-face-spaces-).
+## Installation and Running
+### Local development
+To run the bare application which does _nothing_:
+1. Install requirements: `pip install -r requirements.txt`
+2. Run the streamlit app: `streamlit run app.py`
+This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll
+notice that the app will only show you instructions on what to edit.
+### Docker
+To run the app in a Docker container:
+1. Build the Docker image: `docker build -t haystack-streamlit .`
+2. Run the Docker container: `docker run -p 8501:8501 haystack-streamlit` (make sure to bind any other ports you need)
+3. Open your browser and go to `http://localhost:8501`
+### Repo structure
+- `./utils`: This is where we have 3 files:
+    - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it
+      uses default values. An example of this is
+      in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
+    - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search
+      pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and
+      cache it, and `query()` which is the function called by `app.py` once a user query is received.
+    - `ui.py`: Use this file for any UI and initial value setups.
+- `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search
+  bar, a 'Run' button, and a response that you can highlight answers with.
+- `requirements.txt`: This file includes the required libraries to run the Streamlit app.
+- `document_qa_engine.py`: This file includes the QA pipeline with Haystack.
+### What to edit?
+There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
+- Change the pipelines to use the embedding models, extractive or generative models as you need.
+- If using the `rag` task, change the `default_prompt_template` to use one of our available ones
+  on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
+### Using local LLM models
+To use the `local LLM` mode you can use [LM Studio](https://lmstudio.ai/) or [Ollama](https://ollama.com/).
+For more info on how to run the app with a local LLM model please refer to the documentation of the tool you are using.
+The `local_llm` mode expects an API available at `http://localhost:1234/v1`.
+## Pushing to Hugging Face Spaces 🤗
+Below is an example GitHub action that will let you push your Streamlit app straight to the Hugging Face Hub as a Space.
+A few things to pay attention to:
+1. Create a New Space on Hugging Face with the Streamlit SDK.
+2. Create a Hugging Face token on your HF account.
+3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
+4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for
+   your HF Space too!
+5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any
+   changes to the frontmatter of this readme to display the title, emoji etc you desire.
+6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information,
+   and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml)
+   working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
+```yaml
+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [ main ]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push --force https://{YOUR_HF_USERNAME}:$HF_TOKEN@{YOUR_HF_SPACE_REPO} main
+```

__pycache__/document_qa_engine.cpython-310.pyc ADDED Viewed

Binary file (5.11 kB). View file

__pycache__/utils.cpython-310.pyc ADDED Viewed

Binary file (2.61 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,241 @@

+from dotenv import load_dotenv
+import pandas as pd
+import streamlit as st
+import streamlit_authenticator as stauth
+from streamlit_modal import Modal
+from utils import new_file, clear_memory, append_documentation_to_sidebar, load_authenticator_config, init_qa, \
+    append_header
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack import Document
+load_dotenv()
+OPENAI_MODELS = ['gpt-3.5-turbo',
+                 "gpt-4",
+                 "gpt-4-1106-preview"]
+OPEN_MODELS = [
+    'mistralai/Mistral-7B-Instruct-v0.1',
+    'HuggingFaceH4/zephyr-7b-beta'
+]
+def reset_chat_memory():
+    st.button(
+        'Reset chat memory',
+        key="reset-memory-button",
+        on_click=clear_memory,
+        help="Clear the conversational memory. Currently implemented to retain the 4 most recent messages.",
+        disabled=False)
+def manage_files(modal, document_store):
+    open_modal = st.sidebar.button("Manage Files", use_container_width=True)
+    if open_modal:
+        modal.open()
+    if modal.is_open():
+        with modal.container():
+            uploaded_file = st.file_uploader(
+                "Upload a CV in PDF format",
+                type=("pdf",),
+                on_change=new_file(),
+                disabled=st.session_state['document_qa_model'] is None,
+                label_visibility="collapsed",
+                help="The document is used to answer your questions. The system will process the document and store it in a RAG to answer your questions.",
+            )
+            edited_df = st.data_editor(use_container_width=True, data=st.session_state['files'],
+                                       num_rows='dynamic',
+                                       column_order=['name', 'size', 'is_active'],
+                                       column_config={'name': {'editable': False}, 'size': {'editable': False},
+                                                      'is_active': {'editable': True, 'type': 'checkbox',
+                                                                    'width': 100}}
+                                       )
+            st.session_state['files'] = pd.DataFrame(columns=['name', 'content', 'size', 'is_active'])
+            if uploaded_file:
+                st.session_state['file_uploaded'] = True
+                st.session_state['files'] = pd.concat([st.session_state['files'], edited_df])
+                with st.spinner('Processing the CV content...'):
+                    store_file_in_table(document_store, uploaded_file)
+                    ingest_document(uploaded_file)
+def ingest_document(uploaded_file):
+    if not st.session_state['document_qa_model']:
+        st.warning('Please select a model to start asking questions')
+    else:
+        try:
+            st.session_state['document_qa_model'].ingest_pdf(uploaded_file)
+            st.success('Document processed successfully')
+        except Exception as e:
+            st.error(f"Error processing the document: {e}")
+            st.session_state['file_uploaded'] = False
+def store_file_in_table(document_store, uploaded_file):
+    pdf_content = uploaded_file.getvalue()
+    st.session_state['pdf_content'] = pdf_content
+    st.session_state.messages = []
+    document = Document(content=pdf_content, meta={"name": uploaded_file.name})
+    df = pd.DataFrame(st.session_state['files'])
+    df['is_active'] = False
+    st.session_state['files'] = pd.concat([df, pd.DataFrame(
+        [{"name": uploaded_file.name, "content": pdf_content, "size": len(pdf_content),
+          "is_active": True}])])
+    document_store.write_documents([document])
+def init_session_state():
+    st.session_state.setdefault('files', pd.DataFrame(columns=['name', 'content', 'size', 'is_active']))
+    st.session_state.setdefault('models', [])
+    st.session_state.setdefault('api_keys', {})
+    st.session_state.setdefault('current_selected_model', 'gpt-3.5-turbo')
+    st.session_state.setdefault('current_api_key', '')
+    st.session_state.setdefault('messages', [])
+    st.session_state.setdefault('pdf_content', None)
+    st.session_state.setdefault('memory', None)
+    st.session_state.setdefault('pdf', None)
+    st.session_state.setdefault('document_qa_model', None)
+    st.session_state.setdefault('file_uploaded', False)
+def set_page_config():
+    st.set_page_config(
+        page_title="CV Insights AI Assistant",
+        page_icon=":shark:",
+        initial_sidebar_state="expanded",
+        layout="wide",
+        menu_items={
+            'Get Help': 'https://www.extremelycoolapp.com/help',
+            'Report a bug': "https://www.extremelycoolapp.com/bug",
+            'About': "# This is a header. This is an *extremely* cool app!"
+        }
+    )
+def update_running_model(api_key, model):
+    st.session_state['api_keys'][model] = api_key
+    st.session_state['document_qa_model'] = init_qa(model, api_key)
+def init_api_key_dict():
+    st.session_state['models'] = OPENAI_MODELS + list(OPEN_MODELS) + ['local LLM']
+    for model_name in OPENAI_MODELS:
+        st.session_state['api_keys'][model_name] = None
+def display_chat_messages(chat_box, chat_input):
+    with chat_box:
+        if chat_input:
+            for message in st.session_state.messages:
+                with st.chat_message(message["role"]):
+                    st.markdown(message["content"], unsafe_allow_html=True)
+            st.chat_message("user").markdown(chat_input)
+            with st.chat_message("assistant"):
+                # process user input and generate response
+                response = st.session_state['document_qa_model'].inference(chat_input, st.session_state.messages)
+                st.markdown(response)
+                st.session_state.messages.append({"role": "user", "content": chat_input})
+                st.session_state.messages.append({"role": "assistant", "content": response})
+def setup_model_selection():
+    model = st.selectbox(
+        "Model:",
+        options=st.session_state['models'],
+        index=0,  # default to the first model in the list gpt-3.5-turbo
+        placeholder="Select model",
+        help="Select an LLM:"
+    )
+    if model:
+        if model != st.session_state['current_selected_model']:
+            st.session_state['current_selected_model'] = model
+            if model == 'local LLM':
+                st.session_state['document_qa_model'] = init_qa(model)
+    api_key = st.sidebar.text_input("Enter LLM-authorization Key:", type="password",
+                                    disabled=st.session_state['current_selected_model'] == 'local LLM')
+    if api_key and api_key != st.session_state['current_api_key']:
+        update_running_model(api_key, model)
+        st.session_state['current_api_key'] = api_key
+    return model
+def setup_task_selection(model):
+    # enable extractive and generative tasks if we're using a local LLM or an OpenAI model with an API key
+    if model == 'local LLM' or st.session_state['api_keys'].get(model):
+        task_options = ['Extractive', 'Generative']
+    else:
+        task_options = ['Extractive']
+    task_selection = st.sidebar.radio('Select the task:', task_options)
+    # TODO: Add the task selection logic here (initializing the model based on the task)
+def setup_page_body():
+    chat_box = st.container(height=350, border=False)
+    chat_input = st.chat_input(
+        placeholder="Upload a document to start asking questions...",
+        disabled=not st.session_state['file_uploaded'],
+    )
+    if st.session_state['file_uploaded']:
+        display_chat_messages(chat_box, chat_input)
+class StreamlitApp:
+    def __init__(self):
+        self.authenticator_config = load_authenticator_config()
+        self.document_store = InMemoryDocumentStore()
+        set_page_config()
+        self.authenticator = self.init_authenticator()
+        init_session_state()
+        init_api_key_dict()
+    def init_authenticator(self):
+        return stauth.Authenticate(
+            self.authenticator_config['credentials'],
+            self.authenticator_config['cookie']['name'],
+            self.authenticator_config['cookie']['key'],
+            self.authenticator_config['cookie']['expiry_days']
+        )
+    def setup_sidebar(self):
+        with st.sidebar:
+            st.sidebar.image("resources/puma.png", use_column_width=True)
+            # Sidebar for Task Selection
+            st.sidebar.header('Options:')
+            model = setup_model_selection()
+            setup_task_selection(model)
+            st.divider()
+            self.authenticator.logout()
+            reset_chat_memory()
+            modal = Modal("Manage Files", key="demo-modal")
+            manage_files(modal, self.document_store)
+            st.divider()
+            append_documentation_to_sidebar()
+    def run(self):
+        name, authentication_status, username = self.authenticator.login()
+        if authentication_status:
+            self.run_authenticated_app()
+        elif st.session_state["authentication_status"] is False:
+            st.error('Username/password is incorrect')
+        elif st.session_state["authentication_status"] is None:
+            st.warning('Please enter your username and password')
+    def run_authenticated_app(self):
+        self.setup_sidebar()
+        append_header()
+        setup_page_body()
+app = StreamlitApp()
+app.run()

authenticator_config.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+credentials:
+  usernames:
+    mlreply:
+      email: mlreply@reply.de
+      failed_login_attempts: 0 # Will be managed automatically
+      logged_in: False # Will be managed automatically
+      name: ML Reply
+      password: mlreply # Will be hashed automatically
+cookie:
+  expiry_days: 1
+  key: some_signature_key # Must be string
+  name: some_cookie_name
+#pre-authorized:
+#  emails:
+#    - melsby@gmail.com

document_qa_engine.py ADDED Viewed

	@@ -0,0 +1,141 @@

+from typing import List
+from haystack.dataclasses import ChatMessage
+from pypdf import PdfReader
+from haystack.utils import Secret
+from haystack import Pipeline, Document, component
+from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
+from haystack.components.writers import DocumentWriter
+from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
+from haystack.components.builders import DynamicChatPromptBuilder
+from haystack.components.generators.chat import OpenAIChatGenerator, HuggingFaceTGIChatGenerator
+from haystack.document_stores.types import DuplicatePolicy
+SENTENCE_RETREIVER_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+MAX_TOKENS = 500
+template = """
+As a professional HR recruiter given the following information, answer the question shortly and concisely in 1 or 2 sentences.
+Context:
+{% for document in documents %}
+    {{ document.content }}
+{% endfor %}
+Question: {{question}}
+Answer:
+"""
+@component
+class UploadedFileConverter:
+    """
+    A component to convert uploaded PDF files to Documents
+    """
+    @component.output_types(documents=List[Document])
+    def run(self, uploaded_file):
+        pdf = PdfReader(uploaded_file)
+        documents = []
+        # uploaded file name without .pdf at the end and with _ and page number at the end
+        name = uploaded_file.name.rstrip('.PDF') + '_'
+        for page in pdf.pages:
+            documents.append(
+                Document(
+                    content=page.extract_text(),
+                    meta={'name': name + f"_{page.page_number}"}))
+        return {"documents": documents}
+def create_ingestion_pipeline(document_store):
+    doc_embedder = SentenceTransformersDocumentEmbedder(model=SENTENCE_RETREIVER_MODEL)
+    doc_embedder.warm_up()
+    pipeline = Pipeline()
+    pipeline.add_component("converter", UploadedFileConverter())
+    pipeline.add_component("cleaner", DocumentCleaner())
+    pipeline.add_component("splitter",
+                           DocumentSplitter(split_by="passage", split_length=100, split_overlap=10))
+    pipeline.add_component("embedder", doc_embedder)
+    pipeline.add_component("writer",
+                           DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
+    pipeline.connect("converter", "cleaner")
+    pipeline.connect("cleaner", "splitter")
+    pipeline.connect("splitter", "embedder")
+    pipeline.connect("embedder", "writer")
+    return pipeline
+def create_inference_pipeline(document_store, model_name, api_key):
+    if model_name == "local LLM":
+        generator = OpenAIChatGenerator(api_key=Secret.from_token("<local LLM doesn't need an API key>"),
+                                        model=model_name,
+                                        api_base_url="http://localhost:1234/v1",
+                                        generation_kwargs={"max_tokens": MAX_TOKENS}
+                                        )
+    elif "gpt" in model_name:
+        generator = OpenAIChatGenerator(api_key=Secret.from_token(api_key), model=model_name,
+                                        generation_kwargs={"max_tokens": MAX_TOKENS, "stream": False}
+                                        )
+    else:
+        generator = HuggingFaceTGIChatGenerator(token=Secret.from_token(api_key), model=model_name,
+                                                generation_kwargs={"max_new_tokens": MAX_TOKENS}
+                                                )
+    pipeline = Pipeline()
+    pipeline.add_component("text_embedder",
+                           SentenceTransformersTextEmbedder(model=SENTENCE_RETREIVER_MODEL))
+    pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
+    pipeline.add_component("prompt_builder",
+                           DynamicChatPromptBuilder(runtime_variables=["query", "documents"]))
+    pipeline.add_component("llm", generator)
+    pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
+    pipeline.connect("retriever.documents", "prompt_builder.documents")
+    pipeline.connect("prompt_builder.prompt", "llm.messages")
+    return pipeline
+class DocumentQAEngine:
+    def __init__(self,
+                 model_name,
+                 api_key=None
+                 ):
+        self.api_key = api_key
+        self.model_name = model_name
+        document_store = InMemoryDocumentStore()
+        self.chunks = []
+        self.inference_pipeline = create_inference_pipeline(document_store, model_name, api_key)
+        self.pdf_ingestion_pipeline = create_ingestion_pipeline(document_store)
+    def ingest_pdf(self, uploaded_file):
+        self.pdf_ingestion_pipeline.run({"converter": {"uploaded_file": uploaded_file}})
+    def inference(self, query, input_messages: List[dict]):
+        system_message = ChatMessage.from_system(
+            "You are a professional HR recruiter that answers questions based on the content of the uploaded CV. in 1 or 2 sentences.")
+        messages = [system_message]
+        for message in input_messages:
+            if message["role"] == "user":
+                messages.append(ChatMessage.from_system(message["content"]))
+            else:
+                messages.append(
+                    ChatMessage.from_user(message["content"]))
+        messages.append(ChatMessage.from_user("""
+        Relevant information from the uploaded CV:
+            {% for doc in documents %}
+                {{ doc.content }}
+            {% endfor %}
+            \nQuestion: {{query}}
+            \nAnswer:
+        """))
+        res = self.inference_pipeline.run(data={"text_embedder": {"text": query},
+                                                "prompt_builder": {"prompt_source": messages,
+                                                                   "query": query
+                                                                   }})
+        return res["llm"]["replies"][0].content

requirements.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+# Streamlit
+streamlit~=1.32.2
+streamlit-modal==0.1.2
+streamlit-authenticator==0.3.2
+streamlit-pdf-viewer==0.0.9
+# LLM
+haystack-ai~=2.0.0
+sentence_transformers~=2.6.0
+# Utils
+pandas~=2.2.1
+pypdf~=4.2.0
+pytest~=8.1.1
+python-dotenv~=1.0.1
+# Dev Utils
+watchdog

resources/ml_logo.png ADDED Viewed

resources/puma.png ADDED Viewed

utils.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from document_qa_engine import DocumentQAEngine
+import streamlit as st
+import logging
+from yaml import load, SafeLoader, YAMLError
+def load_authenticator_config(file_path='authenticator_config.yaml'):
+    try:
+        with open(file_path, 'r') as file:
+            authenticator_config = load(file, Loader=SafeLoader)
+            return authenticator_config
+    except FileNotFoundError:
+        logging.error(f"File {file_path} not found.")
+    except YAMLError as error:
+        logging.error(f"Error parsing YAML file: {error}")
+def new_file():
+    st.session_state['loaded_embeddings'] = None
+    st.session_state['doc_id'] = None
+    st.session_state['uploaded'] = True
+    clear_memory()
+def clear_memory():
+    if st.session_state['memory']:
+        st.session_state['memory'].clear()
+def init_qa(model, api_key=None):
+    print(f"Initializing QA with model: {model} and API key: {api_key}")
+    return DocumentQAEngine(model, api_key=api_key)
+def append_header():
+    st.header('📄 Document Insights :rainbow[AI] Assistant 📚', divider='rainbow')
+    st.text("📥 Upload documents in PDF format. Get insights.. ask questions..")
+def append_documentation_to_sidebar():
+    with st.expander("Disclaimer"):
+        st.markdown(
+            """
+            :warning: Do not upload sensitive data. We **temporarily** store text from the uploaded PDF documents solely
+            for the purpose of processing your request, and we **do not assume responsibility** for any subsequent use
+            or handling of the data submitted to third parties LLMs.
+            """)
+    with st.expander("Documentation"):
+        st.markdown(
+            """
+            Upload a CV as PDF document. Once the spinner stops, you can proceed to ask your questions. The answers will
+            be displayed in the right column. The system will answer your questions using the content of the document
+            and mark refrences over the PDF viewer.
+            """)

utils/__pycache__/config.cpython-38.pyc ADDED Viewed

Binary file (1.47 kB). View file

utils/__pycache__/haystack.cpython-38.pyc ADDED Viewed

Binary file (3.59 kB). View file

utils/__pycache__/ui.cpython-38.pyc ADDED Viewed

Binary file (733 Bytes). View file

utils/check_pydantic_version.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import pydantic
+import os
+import fileinput
+def replace_string_in_files(folder_path, old_str, new_str):
+    for subdir, dirs, files in os.walk(folder_path):
+        for file in files:
+            file_path = os.path.join(subdir, file)
+            # Check if the file is a text file (you can modify this condition based on your needs)
+            if file.endswith(".txt") or file.endswith(".py"):
+                # Open the file in place for editing
+                with fileinput.FileInput(file_path, inplace=True) as f:
+                    for line in f:
+                        # Replace the old string with the new string
+                        print(line.replace(old_str, new_str), end='')
+def use_pydantic_v1():
+    module_file_path = pydantic.__file__
+    module_file_path = module_file_path.split('pydantic')[0] + 'haystack'
+    with open(module_file_path+'/schema.py','r') as f:
+        haystack_schema_file = f.read()
+    if 'from pydantic.v1' not in haystack_schema_file:
+        replace_string_in_files(module_file_path, 'from pydantic', 'from pydantic.v1')

utils/config.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import argparse
+import os
+import os
+from dotenv import load_dotenv
+load_dotenv()
+parser = argparse.ArgumentParser(description='This app lists animals')
+document_store_choices = ('inmemory', 'weaviate', 'milvus', 'opensearch')
+parser.add_argument('--store', choices=document_store_choices, default='inmemory', help='DocumentStore selection (default: %(default)s)')
+parser.add_argument('--name', default="Document Insights: Extractive & Generative Methods")
+model_configs = {
+    'EMBEDDING_MODEL': os.getenv("EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L12-v2"),
+    'GENERATIVE_MODEL': os.getenv("GENERATIVE_MODEL", "gpt-4"),
+    #'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "deepset/roberta-base-squad2"),
+    'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "deepset/gelectra-large-germanquad"),
+    #'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "MachineLearningReply/bert-base-german-legal-qa"),
+    'OPENAI_KEY': os.getenv("OPENAI_KEY"),
+    'COHERE_KEY': os.getenv("COHERE_KEY"),
+}
+document_store_configs = {
+# Weaviate Config
+'WEAVIATE_HOST':  os.getenv("WEAVIATE_HOST", "http://localhost"),
+'WEAVIATE_PORT': os.getenv("WEAVIATE_PORT", 8080),
+'WEAVIATE_INDEX': os.getenv("WEAVIATE_INDEX", "Document"),
+'WEAVIATE_EMBEDDING_DIM': os.getenv("WEAVIATE_EMBEDDING_DIM", 768),
+# OpenSearch Config
+'OPENSEARCH_SCHEME': os.getenv("OPENSEARCH_SCHEME",  "https"),
+'OPENSEARCH_USERNAME': os.getenv("OPENSEARCH_USERNAME", "admin"),
+'OPENSEARCH_PASSWORD': os.getenv("OPENSEARCH_PASSWORD", "admin"),
+'OPENSEARCH_HOST': os.getenv("OPENSEARCH_HOST", "localhost"),
+'OPENSEARCH_PORT': os.getenv("OPENSEARCH_PORT", 9200),
+'OPENSEARCH_INDEX':  os.getenv("OPENSEARCH_INDEX", "document"),
+'OPENSEARCH_EMBEDDING_DIM': os.getenv("OPENSEARCH_EMBEDDING_DIM", 768),
+# Milvus Config
+'MILVUS_URI': os.getenv("MILVUS_URI", "http://localhost:19530/default"),
+'MILVUS_INDEX':  os.getenv("MILVUS_INDEX", "document"),
+'MILVUS_EMBEDDING_DIM': os.getenv("MILVUS_EMBEDDING_DIM", 768),
+}

utils/haystack.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import streamlit as st
+from utils.config import document_store_configs, model_configs
+from haystack import Pipeline
+from haystack.schema import Answer
+from haystack.document_stores import BaseDocumentStore
+from haystack.document_stores import InMemoryDocumentStore, OpenSearchDocumentStore, WeaviateDocumentStore
+from haystack.nodes import EmbeddingRetriever, FARMReader, PromptNode, PreProcessor
+#from haystack.nodes import TextConverter, FileTypeClassifier, PDFToTextConverter
+from milvus_haystack import MilvusDocumentStore
+#Use this file to set up your Haystack pipeline and querying
+@st.cache_resource(show_spinner=False)
+def start_preprocessor_node():
+    print('initializing preprocessor node')
+    processor = PreProcessor(
+        clean_empty_lines= True,
+        clean_whitespace=True,
+        clean_header_footer=True,
+        #remove_substrings=None,
+        split_by="word",
+        split_length=100,
+        split_respect_sentence_boundary=True,
+        #split_overlap=0,
+        #max_chars_check= 10_000
+    )
+    return processor
+    #return docs
+@st.cache_resource(show_spinner=False)
+def start_document_store(type: str):
+    #This function starts the documents store of your choice based on your command line preference
+    print('initializing document store')
+    if type == 'inmemory':
+        document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)
+        '''
+        documents = [
+            {
+                'content': "Pi is a super dog",
+                'meta': {'name': "pi.txt"}
+            },
+            {
+                'content': "The revenue of siemens is 5 milion Euro",
+                'meta': {'name': "siemens.txt"}
+            },
+        ]
+        document_store.write_documents(documents)
+        '''
+    elif type == 'opensearch':
+        document_store = OpenSearchDocumentStore(scheme = document_store_configs['OPENSEARCH_SCHEME'],
+                                                 username = document_store_configs['OPENSEARCH_USERNAME'],
+                                                 password = document_store_configs['OPENSEARCH_PASSWORD'],
+                                                 host = document_store_configs['OPENSEARCH_HOST'],
+                                                 port = document_store_configs['OPENSEARCH_PORT'],
+                                                 index = document_store_configs['OPENSEARCH_INDEX'],
+                                                 embedding_dim = document_store_configs['OPENSEARCH_EMBEDDING_DIM'])
+    elif type == 'weaviate':
+        document_store = WeaviateDocumentStore(host = document_store_configs['WEAVIATE_HOST'],
+                                                port = document_store_configs['WEAVIATE_PORT'],
+                                                index = document_store_configs['WEAVIATE_INDEX'],
+                                                embedding_dim = document_store_configs['WEAVIATE_EMBEDDING_DIM'])
+    elif type == 'milvus':
+        document_store = MilvusDocumentStore(uri = document_store_configs['MILVUS_URI'],
+                                            index = document_store_configs['MILVUS_INDEX'],
+                                            embedding_dim = document_store_configs['MILVUS_EMBEDDING_DIM'],
+                                            return_embedding=True)
+    return document_store
+# cached to make index and models load only at start
+@st.cache_resource(show_spinner=False)
+def start_retriever(_document_store: BaseDocumentStore):
+    print('initializing retriever')
+    retriever = EmbeddingRetriever(document_store=_document_store,
+                                   embedding_model=model_configs['EMBEDDING_MODEL'],
+                                   top_k=5)
+    #
+    #_document_store.update_embeddings(retriever)
+    return retriever
+@st.cache_resource(show_spinner=False)
+def start_reader():
+    print('initializing reader')
+    reader = FARMReader(model_name_or_path=model_configs['EXTRACTIVE_MODEL'])
+    return reader
+# cached to make index and models load only at start
+@st.cache_resource(show_spinner=False)
+def start_haystack_extractive(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, _reader: FARMReader):
+    print('initializing pipeline')
+    pipe = Pipeline()
+    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component= _reader, name="Reader", inputs=["Retriever"])
+    return pipe
+@st.cache_resource(show_spinner=False)
+def start_haystack_rag(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, openai_key):
+    prompt_node = PromptNode(default_prompt_template="deepset/question-answering",
+                             model_name_or_path=model_configs['GENERATIVE_MODEL'],
+                             api_key=openai_key,
+                             max_length=500)
+    pipe = Pipeline()
+    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])
+    return pipe
+#@st.cache_data(show_spinner=True)
+def query(_pipeline, question):
+    params = {}
+    results = _pipeline.run(question, params=params)
+    return results
+def initialize_pipeline(task, document_store, retriever, reader, openai_key = ""):
+    if task == 'extractive':
+        return start_haystack_extractive(document_store, retriever, reader)
+    elif task == 'rag':
+        return start_haystack_rag(document_store, retriever, openai_key)

utils/ui.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import streamlit as st
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+def set_initial_state():
+    set_state_if_absent("question", "Ask something here?")
+    set_state_if_absent("results_extractive", None)
+    set_state_if_absent("results_generative", None)
+    set_state_if_absent("task", None)
+def reset_results(*args):
+    st.session_state.results_extractive = None
+    st.session_state.results_generative = None
+    st.session_state.task = None