Spaces:

pkiage
/

credit_risk_modeling_demo

Build error

App Files Files Community

Paul Kiage commited on Mar 6, 2023

Commit

7d861ad

•

1 Parent(s): 9af3e2b

Hugging Face Deployment Setup (#11)

Browse files

* refactor for hugging face space deployment

* docs: HF space branch

Files changed (42) hide show

.github/workflows/check_file_size.yml +16 -0
.github/workflows/sync_to_hf_hub.yml +20 -0
Dockerfile +20 -0
Procfile +0 -1
README.md +22 -58
src/app.py → app.py +27 -29
{src → common}/__init__.py +0 -0
src/features/util_build_features.py → common/data.py +2 -93
common/util.py +391 -0
common/views.py +361 -0
src/features/build_features.py → data_setup.py +42 -15
requirements.txt +0 -0
setup.py +0 -10
setup.sh +0 -13
src/__main__.py +0 -0
src/models/__init__.py +0 -0
src/models/logistic_model.py +0 -33
src/models/logistic_predict_model.py +0 -4
src/models/logistic_test_model.py +0 -4
src/models/logistic_train_model.py +0 -69
src/models/util_predict_model.py +0 -87
src/models/util_predict_model_threshold.py +0 -310
src/models/xgboost_model.py +0 -33
src/models/xgboost_predict_model.py +0 -4
src/models/xgboost_test_model.py +0 -4
src/models/xgboost_train_model.py +0 -68
src/visualization/__init__.py +0 -0
src/visualization/graphs_decision_tree.py +0 -23
src/visualization/graphs_download.py +0 -17
src/visualization/graphs_logistic.py +0 -12
src/visualization/graphs_settings.py +0 -28
src/visualization/graphs_test.py +0 -78
src/visualization/graphs_threshold.py +0 -80
src/visualization/metrics.py +0 -132
{src/features → views}/__init__.py +0 -0
views/decision_tree.py +70 -0
src/models/util_test.py → views/evaluation.py +11 -169
views/logistic.py +119 -0
src/models/util_model_comparison.py → views/model_comparison.py +9 -14
src/models/util_strategy_table.py → views/strategy_table.py +4 -4
views/threshold.py +272 -0
src/models/util_model_class.py → views/typing.py +1 -1

.github/workflows/check_file_size.yml ADDED Viewed

	@@ -0,0 +1,16 @@

+name: Check file size
+on:               # or directly `on: [push]` to run the action on every push on any branch
+  pull_request:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check large files
+        uses: ActionsDesk/lfs-warning@v2.0
+        with:
+          filesizelimit: 10485760 # this is 10MB so we can sync to HF Spaces

.github/workflows/sync_to_hf_hub.yml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push https://pkiage:$HF_TOKEN@huggingface.co/spaces/pkiage/credit_risk_modeling_demo main

Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+# read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
+# you will also find guides on how best to write your Dockerfile
+FROM python:3.9
+RUN apt update
+RUN apt install -y graphviz
+WORKDIR /code
+COPY ./requirements.txt /code/requirements.txt
+RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
+COPY . .
+CMD ["streamlit", "run", "app.py", "--server.address", "0.0.0.0"]
+# CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "7860"]

Procfile DELETED Viewed

	@@ -1 +0,0 @@
1	- web: sh setup.sh && streamlit run src/app.py

README.md CHANGED Viewed

@@ -1,3 +1,14 @@
 # Credit Risk Modelling
 # About
@@ -72,68 +83,29 @@ pip install -r requirements.txt
 https://graphviz.org/download/
-## Build and install local package
-```shell
-python setup.py build
-```
-```shell
-python setup.py install
-```
 ### Run the streamlit app (app.py) by running the following in terminal (from repository root folder):
 ```shell
-streamlit run src/app.py
 ```
 ## Deployed setup details
-For faster model building and testing (particularly XGBoost) a local setup or on a more powerful server than free heroku dyno type is recommended. ([tutorials on servers for data science & ML](https://course.fast.ai))
-⚠️⚠️⚠️
-***UPDATE: In [Heroku’s Next Chapter](https://blog.heroku.com/next-chapter) free dynos will be removed starting [November 28, 2022](https://help.heroku.com/RSBRUH58/removal-of-heroku-free-product-plans-faq)***
-*[Hosting Streamlit app would require](https://discuss.streamlit.io/t/can-i-host-streamlit-on-now-sh-vercel/3189) a Platform as a service (PaaS) since [Streamlit apps aren't static thus can't run on static web host](https://discuss.streamlit.io/t/hosting-streamlit-on-github-pages/356/2).*
-*Viable alternatives include paid services such as AWS, Azure, GCP, DigitalOcean, Heroku, [Replit](https://replit.com/heroku) paid version (due to Repl Resources used) etc.*
-*Platforms such as Github Pages, Netifly, & Vercel currenty mostly require the app to [output a static website](https://answers.netlify.com/t/how-to-run-streamlit-hello-on-netlify/11899/2) since most of those services will not run Python ([or any server process](https://answers.netlify.com/t/support-guide-can-i-run-a-web-server-http-listener-and-or-database-at-netlify/3078)) at browse time. Netifly for instance is designed for the [Jamstack](https://jamstack.org/) that doesn't depend on a "web server". Vercel on the other hand requires either a [`handler` that inherits from the `BaseHTTPRequestHandler` class or an app that exposes a WSGI or ASGI Application](https://vercel.com/docs/runtimes#advanced-usage/advanced-python-usage) - [Tornado](https://www.tornadoweb.org/en/stable/index.html?highlight=wsgi#threads-and-wsgi) a [dependency of Streamlit](https://openbase.com/python/streamlit/dependencies) is [currently not compatible with WSGI](https://www.reddit.com/r/learnpython/comments/grmjfo/comment/fs4elmx/).*
-Currently hosted on [Streamlit Community Cloud](https://blog.streamlit.io/host-your-streamlit-app-for-free/)
-⚠️⚠️⚠️
-[Free Heroku dyno type](https://devcenter.heroku.com/articles/dyno-types) was used to deploy the app
-Memory (RAM): 512 MB
-CPU Share: 1x
-Compute: 1x-4x
-Dedicated: no
-Sleeps: yes
-[Enabled Autodeploy from Github](https://devcenter.heroku.com/articles/github-integration) if want to [manually deploy to Heroku](https://devcenter.heroku.com/articles/git#deploy-your-code) the steps are as follows:
-From main branch:
-```shell
-heroku login
-git push heroku main
-```
-From branch beside main:
 ```shell
-heroku login
-git push heroku branch_name:main
 ```
 # Roadmap
@@ -222,12 +194,4 @@ code2flow src/models/util_model_comparison.py -o docs/call-graph/util_model_comp
 [A Gentle Introduction to Threshold-Moving for Imbalanced Classification](https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/)
-- Selecting optimal threshold using Youden's J statistic
-[Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/)
-- Project structure
-[GraphViz Buildpack](https://github.com/weibeld/heroku-buildpack-graphviz)
-- Buildpack used for Heroku deployment

+---
+title: Credit Risk Modeling
+emoji: 📈
+colorFrom: indigo
+colorTo: blue
+sdk: docker
+app_port: 8501
+pinned: false
+license: openrail
+---
 # Credit Risk Modelling
 # About
 https://graphviz.org/download/
 ### Run the streamlit app (app.py) by running the following in terminal (from repository root folder):
 ```shell
+streamlit app.py
 ```
 ## Deployed setup details
+**Hugging Face Space Deployment Tips**
+Initial Setup
+- [When creating the Spaces Configuration Reference](https://huggingface.co/docs/hub/spaces-config-reference) check logs to specify the [Docker Space](https://huggingface.co/docs/hub/spaces-sdks-docker) app_port based on build
+- In Dockerfile bind Streamlit to a port e.g. 0.0.0.0
+- [Install Graphiz on Debian](https://installati.one/debian/11/graphviz/) rather than use Streamlit Space to solve ```failed to execute posixpath('dot'), make sure the graphviz executables are on your systems' path``` error given don't have access to terminal with Streamlit Space
 ```shell
+git remote add space https://huggingface.co/spaces/pkiage/credit_risk_modeling_demo
+git push --force space main
 ```
+- [When syncing with Hugging Face via Github Actions](https://huggingface.co/docs/hub/spaces-github-actions) the [User Access Token](https://huggingface.co/docs/hub/security-tokens) created on Hugging Face (HF) should have write access
+- Run space from main branch since running from [other branches currently isn't suppported](https://discuss.huggingface.co/t/is-it-possible-to-run-apps-off-of-non-main-branches-in-a-space/18086)
 # Roadmap
 [A Gentle Introduction to Threshold-Moving for Imbalanced Classification](https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/)
+- Selecting optimal threshold using Youden's J statistic

src/app.py → app.py RENAMED Viewed

@@ -1,24 +1,17 @@
-import streamlit as st
 from typing import OrderedDict
-from features.build_features import initialise_data
-from models.xgboost_model import xgboost_class
-from models.logistic_model import logistic_class
-from models.util_model_comparison import model_comparison_view
-from models.util_strategy_table import strategy_table_view
 def main():
-    st.write("Source code: https://github.com/pkiage/tool-credit-risk-modelling")
     currency_options = ["USD", "KES", "GBP"]
-    model_options = ["XGBoost", "Logistic"]
     currency = st.sidebar.selectbox(
         label="What currency will you be using?", options=currency_options
     )
@@ -31,25 +24,30 @@ def main():
     st.title("Modelling")
     models_selected_list = st.sidebar.multiselect(
         label="Select model", options=model_options, default=model_options
     )
     models_selected_set = set(models_selected_list)
-    model_classes = OrderedDict()
-    if "Logistic" in models_selected_set:
-        logistic_model_class = logistic_class(split_dataset, currency)
-        model_classes["Logistic"] = logistic_model_class
-    if "XGBoost" in models_selected_set:
-        xgboost_model_class = xgboost_class(split_dataset, currency)
-        model_classes["XGBoost"] = xgboost_model_class
-    model_comparison_view(split_dataset, model_classes)
-    strategy_table_view(currency, model_classes)
 if __name__ == "__main__":

 from typing import OrderedDict
+import streamlit as st
+from data_setup import initialise_data
+from views.decision_tree import decisiontree_view
+from views.logistic import logistic_view
+from views.model_comparison import model_comparison_view
+from views.strategy_table import strategy_table_view
+import os
+os.environ["PATH"] += os.pathsep + 'C:\Program Files (x86)\Graphviz0.19.1/bin/'
 def main():
     currency_options = ["USD", "KES", "GBP"]
     currency = st.sidebar.selectbox(
         label="What currency will you be using?", options=currency_options
     )
     st.title("Modelling")
+    model_options = ["Logistic Regression", "Decision Trees"]
+    # Returns list
     models_selected_list = st.sidebar.multiselect(
         label="Select model", options=model_options, default=model_options
     )
     models_selected_set = set(models_selected_list)
+    model_views = OrderedDict()
+    if "Logistic Regression" in models_selected_set:
+        logistic_model_view = logistic_view(split_dataset, currency)
+        model_views["Logistic Regression"] = logistic_model_view
+    if "Decision Trees" in models_selected_set:
+        decision_tree_model_view = decisiontree_view(split_dataset, currency)
+        model_views["Decision Trees"] = decision_tree_model_view
+    if models_selected_list:
+        model_comparison_view(
+            split_dataset,
+            model_views,
+        )
+        strategy_table_view(currency, model_views)
 if __name__ == "__main__":

{src → common}/__init__.py RENAMED Viewed

File without changes

src/features/util_build_features.py → common/data.py RENAMED Viewed

@@ -1,13 +1,10 @@
-import streamlit as st
 from typing import List, Union, cast
 from dataclasses import dataclass
 from sklearn.model_selection import train_test_split
 import pandas as pd
 @dataclass
 class SplitDataset:
@@ -95,91 +92,3 @@ class Dataset:
             y_train=cast(pd.Series, y_train),
             y_test=cast(pd.Series, y_test),
         )
-def drop_columns(df, columns):
-    return df.drop(columns, axis=1)
-def remove_less_than_0_columns(df, column):
-    df[column].dropna()
-    return df.loc[(df[column] != 0).any(1)]
-def boolean_int_condition_label(df, label_column_name, condition):
-    df[label_column_name] = condition
-    y = df[label_column_name].astype(int)
-    df = drop_columns(df, label_column_name)
-    return y, df
-@st.cache(suppress_st_warning=True)
-def undersample_training_data(
-    df: pd.DataFrame, column_name: str, split_dataset
-):
-    count_nondefault, count_default = split_dataset.X_y_train[
-        column_name
-    ].value_counts()
-    nondefaults = df[df[column_name] == 0]  # 0
-    defaults = df[df[column_name] == 1]
-    under_sample = min(count_nondefault, count_default)
-    nondefaults_under = nondefaults.sample(under_sample)
-    defaults_under = defaults.sample(under_sample)
-    X_y_train_under = pd.concat(
-        [
-            nondefaults_under.reset_index(drop=True),
-            defaults_under.reset_index(drop=True),
-        ],
-        axis=0,
-    )
-    X_train_under = X_y_train_under.drop([column_name], axis=1)  # remove label
-    y_train_under = X_y_train_under[column_name]  # label only
-    class_balance_default = X_y_train_under[column_name].value_counts()
-    return [
-        X_train_under,
-        y_train_under,
-        X_y_train_under,
-        class_balance_default,
-    ]
-def select_predictors(dataset):
-    st.header("Predictors")
-    possible_columns = dataset.x_values_column_names
-    selected_columns = st.sidebar.multiselect(
-        label="Select Predictors",
-        options=possible_columns,
-        default=possible_columns,
-    )
-    return dataset.x_values_filtered_columns(selected_columns)
-def import_data():
-    if "input_data_frame" not in st.session_state:
-        st.session_state.input_data_frame = pd.read_csv(
-            r"./data/processed/cr_loan_w2.csv"
-        )
-    if "dataset" not in st.session_state:
-        df = cast(pd.DataFrame, st.session_state.input_data_frame)
-        dataset = Dataset(
-            df=df,
-            random_state=123235,
-            test_size=40,
-        )
-        st.session_state.dataset = dataset
-    else:
-        dataset = st.session_state.dataset
-    return dataset

 from typing import List, Union, cast
 from dataclasses import dataclass
 from sklearn.model_selection import train_test_split
 import pandas as pd
+from common.util import drop_columns
 @dataclass
 class SplitDataset:
             y_train=cast(pd.Series, y_train),
             y_test=cast(pd.Series, y_test),
         )

common/util.py ADDED Viewed

	@@ -0,0 +1,391 @@

+# DATA MANIPULATION & ANALYSIS
+import pickle
+import streamlit as st
+# Arrays
+import numpy as np
+# DataFrames and Series
+import pandas as pd
+# Returns the indices of the maximum values along an axis
+from numpy import argmax
+# MODELLING
+# Logistic regression
+from sklearn.linear_model import LogisticRegression
+from sklearn.model_selection import StratifiedKFold
+# XGBoosted Decision Trees
+import xgboost as xgb
+# REPORTING, EVALUATION, AND INTERPRETATION
+# Classification report
+from sklearn.metrics import classification_report
+# Reciever Operator Curve
+from sklearn.metrics import roc_curve
+# Evaluate a score by cross-validation
+from sklearn.model_selection import cross_val_score
+# # Functions
+def drop_columns(df, columns):
+    return df.drop(columns, axis=1)
+def remove_less_than_0_columns(df, column):
+    df[column].dropna()
+    return df.loc[(df[column] != 0).any(1)]
+def boolean_int_condition_label(df, label_column_name, condition):
+    df[label_column_name] = condition
+    y = df[label_column_name].astype(int)
+    df = drop_columns(df, label_column_name)
+    return y, df
+@st.cache(suppress_st_warning=True)
+def undersample_training_data(
+    df: pd.DataFrame, column_name: str, split_dataset
+):
+    count_nondefault, count_default = split_dataset.X_y_train[
+        column_name
+    ].value_counts()
+    nondefaults = df[df[column_name] == 0]  # 0
+    defaults = df[df[column_name] == 1]
+    under_sample = min(count_nondefault, count_default)
+    nondefaults_under = nondefaults.sample(under_sample)
+    defaults_under = defaults.sample(under_sample)
+    X_y_train_under = pd.concat(
+        [
+            nondefaults_under.reset_index(drop=True),
+            defaults_under.reset_index(drop=True),
+        ],
+        axis=0,
+    )
+    X_train_under = X_y_train_under.drop([column_name], axis=1)  # remove label
+    y_train_under = X_y_train_under[column_name]  # label only
+    class_balance_default = X_y_train_under[column_name].value_counts()
+    return [
+        X_train_under,
+        y_train_under,
+        X_y_train_under,
+        class_balance_default,
+    ]
+def create_coeffient_feature_dictionary_logistic_model(
+    logistic_model, training_data
+):
+    return {
+        feat: coef
+        for coef, feat in zip(
+            logistic_model.coef_[0, :], training_data.columns
+        )
+    }
+@st.cache(suppress_st_warning=True)
+def test_variables_logistic(X_train, y_train):
+    # Create and fit the logistic regression model
+    return LogisticRegression(solver="lbfgs").fit(X_train, np.ravel(y_train))
+@st.cache(suppress_st_warning=True)
+def print_coeff_logistic(clf_logistic_model, split_dataset):
+    # Dictionary of features and their coefficients
+    return create_coeffient_feature_dictionary_logistic_model(
+        clf_logistic_model, split_dataset.X_train
+    )
+@st.cache(suppress_st_warning=True, hash_funcs={
+    xgb.XGBClassifier: pickle.dumps
+})
+def test_variables_gbt(X_train, y_train):
+    # Using hyperparameters learning_rate and max_depth
+    return xgb.XGBClassifier(
+        learning_rate=0.1,
+        max_depth=7,
+        use_label_encoder=False,
+        eval_metric="logloss",
+    ).fit(X_train, np.ravel(y_train), eval_metric="logloss")
+# In[398]:
+def get_df_trueStatus_probabilityDefault_threshStatus_loanAmount(
+    model, X, y, threshold, loan_amount_col_name
+):
+    true_status = y.to_frame()
+    loan_amount = X[loan_amount_col_name]
+    clf_prediction_prob = model.predict_proba(np.ascontiguousarray(X))
+    clf_prediction_prob_df = pd.DataFrame(
+        clf_prediction_prob[:, 1], columns=["PROB_DEFAULT"]
+    )
+    clf_thresh_predicted_default_status = (
+        clf_prediction_prob_df["PROB_DEFAULT"]
+        .apply(lambda x: 1 if x > threshold else 0)
+        .rename("PREDICT_DEFAULT_STATUS")
+    )
+    return pd.concat(
+        [
+            true_status.reset_index(drop=True),
+            clf_prediction_prob_df.reset_index(drop=True),
+            clf_thresh_predicted_default_status.reset_index(drop=True),
+            loan_amount.reset_index(drop=True),
+        ],
+        axis=1,
+    )
+def find_best_threshold_J_statistic(y, clf_prediction_prob_df):
+    fpr, tpr, thresholds = roc_curve(y, clf_prediction_prob_df)
+    # get the best threshold
+    # Youden’s J statistic tpr-fpr
+    # Argmax to get the index in
+    # thresholds
+    return thresholds[argmax(tpr - fpr)]
+# In[399]:
+# Function that makes dataframe with probability of default, predicted default status based on threshold
+# and actual default status
+def model_probability_values_df(model, X):
+    return pd.DataFrame(model.predict_proba(X)[:, 1], columns=["PROB_DEFAULT"])
+def apply_threshold_to_probability_values(probability_values, threshold):
+    return (
+        probability_values["PROB_DEFAULT"]
+        .apply(lambda x: 1 if x > threshold else 0)
+        .rename("PREDICT_DEFAULT_STATUS")
+    )
+@st.cache(suppress_st_warning=True)
+def find_best_threshold_J_statistic(y, clf_prediction_prob_df):
+    fpr, tpr, thresholds = roc_curve(y, clf_prediction_prob_df)
+    # get the best threshold
+    J = tpr - fpr  # Youden’s J statistic
+    ix = argmax(J)
+    return thresholds[ix]
+# In[401]:
+def create_cross_validation_df(
+    X, y, eval_metric, seed, trees, n_folds, early_stopping_rounds
+):
+    # Test data x and y
+    DTrain = xgb.DMatrix(X, label=y)
+    # auc or logloss
+    params = {
+        "eval_metric": eval_metric,
+        "objective": "binary:logistic",  # logistic say 0 or 1 for loan status
+        "seed": seed,
+    }
+    # Create the data frame of cross validations
+    cv_df = xgb.cv(
+        params,
+        DTrain,
+        num_boost_round=trees,
+        nfold=n_folds,
+        early_stopping_rounds=early_stopping_rounds,
+        shuffle=True,
+    )
+    return [DTrain, cv_df]
+# In[450]:
+def cross_validation_scores(model, X, y, nfold, score, seed):
+    # return cv scores of metric
+    return cross_val_score(
+        model,
+        np.ascontiguousarray(X),
+        np.ravel(np.ascontiguousarray(y)),
+        cv=StratifiedKFold(n_splits=nfold, shuffle=True, random_state=seed),
+        scoring=score,
+    )
+def default_status_per_threshold(threshold_list, prob_default):
+    threshold_default_status_list = []
+    for threshold in threshold_list:
+        threshold_default_status = prob_default.apply(
+            lambda x: 1 if x > threshold else 0
+        )
+        threshold_default_status_list.append(threshold_default_status)
+    return threshold_default_status_list
+def classification_report_per_threshold(
+    threshold_list, threshold_default_status_list, y_test
+):
+    target_names = ["Non-Default", "Default"]
+    classification_report_list = []
+    for threshold_default_status in threshold_default_status_list:
+        thresh_classification_report = classification_report(
+            y_test,
+            threshold_default_status,
+            target_names=target_names,
+            output_dict=True,
+            zero_division=0,
+        )
+        classification_report_list.append(thresh_classification_report)
+    # Return threshold classification report dict
+    return dict(zip(threshold_list, classification_report_list))
+def thresh_classification_report_recall_accuracy(
+    thresh_classification_report_dict,
+):
+    thresh_def_recalls_list = []
+    thresh_nondef_recalls_list = []
+    thresh_accs_list = []
+    for x in [*thresh_classification_report_dict]:
+        thresh_def_recall = thresh_classification_report_dict[x]["Default"][
+            "recall"
+        ]
+        thresh_def_recalls_list.append(thresh_def_recall)
+        thresh_nondef_recall = thresh_classification_report_dict[x][
+            "Non-Default"
+        ]["recall"]
+        thresh_nondef_recalls_list.append(thresh_nondef_recall)
+        thresh_accs = thresh_classification_report_dict[x]["accuracy"]
+        thresh_accs_list.append(thresh_accs)
+    return [
+        thresh_def_recalls_list,
+        thresh_nondef_recalls_list,
+        thresh_accs_list,
+    ]
+def create_accept_rate_list(start, end, samples):
+    return np.linspace(start, end, samples, endpoint=True)
+def create_strategyTable_df(
+    start, end, samples, actual_probability_predicted_acc_rate, true, currency
+):
+    accept_rates = create_accept_rate_list(start, end, samples)
+    thresholds_strat = []
+    bad_rates_start = []
+    Avg_Loan_Amnt = actual_probability_predicted_acc_rate[true].mean()
+    num_accepted_loans_start = []
+    for rate in accept_rates:
+        # Calculate the threshold for the acceptance rate
+        thresh = np.quantile(
+            actual_probability_predicted_acc_rate["PROB_DEFAULT"], rate
+        ).round(3)
+        # Add the threshold value to the list of thresholds
+        thresholds_strat.append(
+            np.quantile(
+                actual_probability_predicted_acc_rate["PROB_DEFAULT"], rate
+            ).round(3)
+        )
+        # Reassign the loan_status value using the threshold
+        actual_probability_predicted_acc_rate[
+            "PREDICT_DEFAULT_STATUS"
+        ] = actual_probability_predicted_acc_rate["PROB_DEFAULT"].apply(
+            lambda x: 1 if x > thresh else 0
+        )
+        # Create a set of accepted loans using this acceptance rate
+        accepted_loans = actual_probability_predicted_acc_rate[
+            actual_probability_predicted_acc_rate["PREDICT_DEFAULT_STATUS"]
+            == 0
+        ]
+        # Calculate and append the bad rate using the acceptance rate
+        bad_rates_start.append(
+            np.sum((accepted_loans[true]) / len(accepted_loans[true])).round(3)
+        )
+        # Accepted loans
+        num_accepted_loans_start.append(len(accepted_loans))
+    # Calculate estimated value
+    money_accepted_loans = [
+        accepted_loans * Avg_Loan_Amnt
+        for accepted_loans in num_accepted_loans_start
+    ]
+    money_bad_accepted_loans = [
+        2 * money_accepted_loan * bad_rate
+        for money_accepted_loan, bad_rate in zip(
+            money_accepted_loans, bad_rates_start
+        )
+    ]
+    zip_object = zip(money_accepted_loans, money_bad_accepted_loans)
+    estimated_value = [
+        money_accepted_loan - money_bad_accepted_loan
+        for money_accepted_loan, money_bad_accepted_loan in zip_object
+    ]
+    accept_rates = ["{:.2f}".format(elem) for elem in accept_rates]
+    thresholds_strat = ["{:.2f}".format(elem) for elem in thresholds_strat]
+    bad_rates_start = ["{:.2f}".format(elem) for elem in bad_rates_start]
+    estimated_value = ["{:.2f}".format(elem) for elem in estimated_value]
+    return (
+        pd.DataFrame(
+            zip(
+                accept_rates,
+                thresholds_strat,
+                bad_rates_start,
+                num_accepted_loans_start,
+                estimated_value,
+            ),
+            columns=[
+                "Acceptance Rate",
+                "Threshold",
+                "Bad Rate",
+                "Num Accepted Loans",
+                f"Estimated Value ({currency})",
+            ],
+        )
+        .sort_values(by="Acceptance Rate", axis=0, ascending=False)
+        .reset_index(drop=True)
+    )

common/views.py ADDED Viewed

	@@ -0,0 +1,361 @@

+from typing import OrderedDict
+import streamlit as st  # works on command prompt
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import xgboost as xgb
+from sklearn.metrics import (
+    roc_curve,
+)
+from sklearn.calibration import calibration_curve
+from xgboost import plot_tree
+from views.typing import ModelView
+def plot_logistic_coeff_barh(coef_dict, x, y):
+    fig = plt.figure(figsize=(x, y))
+    coef_dict_sorted = dict(
+        sorted(coef_dict.items(), key=lambda item: item[1], reverse=False)
+    )
+    plt.barh(*zip(*coef_dict_sorted.items()))
+    return fig
+def print_negative_coefficients_logistic_model(coef_dict):
+    # Equal to or less than 0
+    NegativeCoefficients = dict(
+        filter(lambda x: x[1] <= 0.0, coef_dict.items())
+    )
+    NegativeCoefficientsSorted = sorted(
+        NegativeCoefficients.items(), key=lambda x: x[1], reverse=False
+    )
+    text = (
+        "\n\nFeatures the model found to be negatively correlated with probability of default are:"
+        "\n{negative_features}:"
+    )
+    st.markdown(text.format(negative_features=NegativeCoefficientsSorted))
+    st.markdown(type(NegativeCoefficientsSorted))
+    st.markdown(NegativeCoefficients.items())
+def print_positive_coefficients_logistic_model(coef_dict):
+    # Equal to or greater than 0
+    PositiveCoefficients = dict(
+        filter(lambda x: x[1] >= 0.0, coef_dict.items())
+    )
+    PositiveCoefficientsSorted = sorted(
+        PositiveCoefficients.items(), key=lambda x: x[1], reverse=True
+    )
+    text = (
+        "\n\nFeatures the model found to be positively correlated with probability of default are:"
+        "\n{positive_features}:"
+    )
+    st.markdown(text.format(positive_features=PositiveCoefficientsSorted))
+def plot_importance_gbt(clf_gbt_model, barxsize, barysize):
+    axobject1 = xgb.plot_importance(clf_gbt_model, importance_type="weight")
+    fig1 = axobject1.figure
+    st.write("Feature Importance Plot (Gradient Boosted Tree)")
+    fig1.set_size_inches(barxsize, barysize)
+    return fig1
+def download_importance_gbt(fig1, barxsize, barysize):
+    if st.button(
+        "Download Feature Importance Plot as png (Gradient Boosted Tree)"
+    ):
+        dpisize = max(barxsize, barysize)
+        plt.savefig("bar.png", dpi=dpisize * 96, bbox_inches="tight")
+        fig1.set_size_inches(barxsize, barysize)
+def plot_tree_gbt(treexsize, treeysize, clf_gbt_model):
+    plot_tree(clf_gbt_model)
+    fig2 = plt.gcf()
+    fig2.set_size_inches(treexsize, treeysize)
+    return fig2
+def download_tree_gbt(treexsize, treeysize):
+    if st.button("Download Decision Tree Plot as png (Gradient Boosted Tree)"):
+        dpisize = max(treexsize, treeysize)
+        plt.savefig("tree.png", dpi=dpisize * 96, bbox_inches="tight")
+def cross_validation_graph(cv, eval_metric, trees):
+    # Plot the test AUC scores for each iteration
+    fig = plt.figure()
+    plt.plot(cv[cv.columns[2]])
+    plt.title(
+        "Test {eval_metric} Score Over {it_numbr} Iterations".format(
+            eval_metric=eval_metric, it_numbr=trees
+        )
+    )
+    plt.xlabel("Iteration Number")
+    plt.ylabel("Test {eval_metric} Score".format(eval_metric=eval_metric))
+    return fig
+def recall_accuracy_threshold_tradeoff_fig(
+    widthsize,
+    heightsize,
+    threshold_list,
+    thresh_def_recalls_list,
+    thresh_nondef_recalls_list,
+    thresh_accs_list,
+):
+    fig = plt.figure(figsize=(widthsize, heightsize))
+    plt.plot(threshold_list, thresh_def_recalls_list, label="Default Recall")
+    plt.plot(
+        threshold_list, thresh_nondef_recalls_list, label="Non-Default Recall"
+    )
+    plt.plot(threshold_list, thresh_accs_list, label="Model Accuracy")
+    plt.xlabel("Probability Threshold")
+    plt.ylabel("Score")
+    plt.xlim(0, 1)
+    plt.ylim(0, 1)
+    plt.legend()
+    plt.title("Recall and Accuracy Score Tradeoff with Probability Threshold")
+    plt.grid(False)
+    return fig
+def roc_auc_compare_n_models(y, model_views: OrderedDict[str, ModelView]):
+    colors = ["blue", "green"]
+    fig = plt.figure()
+    for color_idx, (model_name, model_view) in enumerate(model_views.items()):
+        fpr, tpr, _thresholds = roc_curve(
+            y, model_view.prediction_probability_df
+        )
+        plt.plot(fpr, tpr, color=colors[color_idx], label=f"{model_name}")
+    plt.plot([0, 1], [0, 1], linestyle="--", label="Random Prediction")
+    model_names = list(model_views.keys())
+    if not model_names:
+        model_name_str = "None"
+    elif len(model_names) == 1:
+        model_name_str = model_names[0]
+    else:
+        model_name_str = " and ".join(
+            [", ".join(model_names[:-1]), model_names[-1]]
+        )
+    plt.title(f"ROC Chart for {model_name_str} on the Probability of Default")
+    plt.xlabel("False Positive Rate (FP Rate)")
+    plt.ylabel("True Positive Rate (TP Rate)")
+    plt.legend()
+    plt.grid(False)
+    plt.xlim(0, 1)
+    plt.ylim(0, 1)
+    return fig
+def calibration_curve_report_commented_n(
+    y, model_views: OrderedDict[str, ModelView], bins: int
+):
+    fig = plt.figure()
+    for model_name, model_view in model_views.items():
+        frac_of_pos, mean_pred_val = calibration_curve(
+            y,
+            model_view.prediction_probability_df,
+            n_bins=bins,
+            normalize=True,
+        )
+        plt.plot(mean_pred_val, frac_of_pos, "s-", label=f"{model_name}")
+    # Create the calibration curve plot with the guideline
+    plt.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
+    plt.ylabel("Fraction of positives")
+    plt.xlabel("Average Predicted Probability")
+    plt.title("Calibration Curve")
+    plt.legend()
+    plt.grid(False)
+    plt.xlim(0, 1)
+    plt.ylim(0, 1)
+    return fig
+def acceptance_rate_threshold_fig(probability_default, acceptancerate, bins):
+    # Probability distribution
+    probability_stat_distribution = probability_default.describe()
+    # Acceptance rate threshold
+    acc_rate_thresh = np.quantile(probability_default, acceptancerate)
+    fig = plt.figure()
+    plt.hist(
+        probability_default,
+        color="blue",
+        bins=bins,
+        histtype="bar",
+        ec="white",
+    )
+    # Add a reference line to the plot for the threshold
+    plt.axvline(x=acc_rate_thresh, color="red")
+    plt.title("Acceptance Rate Thershold")
+    return (
+        fig,
+        probability_stat_distribution,
+        acc_rate_thresh,
+    )
+def streamlit_2columns_metrics_pct_df(
+    column1name_label: str,
+    column2name_label: str,
+    df: pd.DataFrame,
+):
+    (
+        column1name,
+        column2name,
+    ) = st.columns(2)
+    with column1name:
+        st.metric(
+            label=column1name_label,
+            value="{:.0%}".format(df.value_counts().get(1) / df.shape[0]),
+            delta=None,
+            delta_color="normal",
+        )
+    with column2name:
+        st.metric(
+            label=column2name_label,
+            value="{:.0%}".format(df.value_counts().get(0) / df.shape[0]),
+            delta=None,
+            delta_color="normal",
+        )
+def streamlit_2columns_metrics_df(
+    column1name_label: str,
+    column2name_label: str,
+    df: pd.DataFrame,
+):
+    (
+        column1name,
+        column2name,
+    ) = st.columns(2)
+    with column1name:
+        st.metric(
+            label=column1name_label,
+            value=df.value_counts().get(1),
+            delta=None,
+            delta_color="normal",
+        )
+    with column2name:
+        st.metric(
+            label=column2name_label,
+            value=df.value_counts().get(0),
+            delta=None,
+            delta_color="normal",
+        )
+def streamlit_2columns_metrics_df_shape(df: pd.DataFrame):
+    (
+        column1name,
+        column2name,
+    ) = st.columns(2)
+    with column1name:
+        st.metric(
+            label="Rows",
+            value=df.shape[0],
+            delta=None,
+            delta_color="normal",
+        )
+    with column2name:
+        st.metric(
+            label="Columns",
+            value=df.shape[1],
+            delta=None,
+            delta_color="normal",
+        )
+def streamlit_2columns_metrics_pct_series(
+    column1name_label: str,
+    column2name_label: str,
+    series: pd.Series,
+):
+    (
+        column1name,
+        column2name,
+    ) = st.columns(2)
+    with column1name:
+        st.metric(
+            label=column1name_label,
+            value="{:.0%}".format(series.get(1) / series.sum()),
+            delta=None,
+            delta_color="normal",
+        )
+    with column2name:
+        st.metric(
+            label=column2name_label,
+            value="{:.0%}".format(series.get(0) / series.sum()),
+            delta=None,
+            delta_color="normal",
+        )
+def streamlit_2columns_metrics_series(
+    column1name_label: str,
+    column2name_label: str,
+    series: pd.Series,
+):
+    (
+        column1name,
+        column2name,
+    ) = st.columns(2)
+    with column1name:
+        st.metric(
+            label=column1name_label,
+            value=series.get(1),
+            delta=None,
+            delta_color="normal",
+        )
+    with column2name:
+        st.metric(
+            label=column2name_label,
+            value=series.get(0),
+            delta=None,
+            delta_color="normal",
+        )
+def streamlit_chart_setting_height_width(
+    title: str,
+    default_widthvalue: int,
+    default_heightvalue: int,
+    widthkey: str,
+    heightkey: str,
+):
+    with st.expander(title):
+        lbarx_col, lbary_col = st.columns(2)
+        with lbarx_col:
+            width_size = st.number_input(
+                label="Width in inches:",
+                value=default_widthvalue,
+                key=widthkey,
+            )
+        with lbary_col:
+            height_size = st.number_input(
+                label="Height in inches:",
+                value=default_heightvalue,
+                key=heightkey,
+            )
+    return width_size, height_size

src/features/build_features.py → data_setup.py RENAMED Viewed

@@ -1,19 +1,13 @@
-from typing import List, Union, cast, Tuple
-from dataclasses import dataclass
-from sklearn.model_selection import train_test_split
-import pandas as pd
 import streamlit as st
-from  features.util_build_features import (
-    Dataset,
-    SplitDataset,
     undersample_training_data,
-    select_predictors,
-    import_data)
-from  visualization.metrics import (
     streamlit_2columns_metrics_df_shape,
     streamlit_2columns_metrics_series,
     streamlit_2columns_metrics_pct_series,
@@ -22,9 +16,22 @@ from  visualization.metrics import (
 )
 def initialise_data() -> Tuple[Dataset, SplitDataset]:
-    dataset = import_data()
     st.write(
         "Assuming data is already cleaned and relevant features (predictors) added."
@@ -34,12 +41,31 @@ def initialise_data() -> Tuple[Dataset, SplitDataset]:
         st.dataframe(dataset.df)
         streamlit_2columns_metrics_df_shape(dataset.df)
-    selected_x_values = select_predictors(dataset)
     with st.expander("Predictors Dataframe (X)"):
         st.dataframe(selected_x_values)
         streamlit_2columns_metrics_df_shape(selected_x_values)
     st.header("Split Testing and Training Data")
     test_size_slider_col, seed_col = st.columns(2)
@@ -62,6 +88,7 @@ def initialise_data() -> Tuple[Dataset, SplitDataset]:
     split_dataset = dataset.train_test_split(selected_x_values)
     true_status = split_dataset.y_test.to_frame().value_counts()
     st.sidebar.metric(

+from typing import Tuple, cast
+import pandas as pd
 import streamlit as st
+from common.data import Dataset, SplitDataset
+from common.util import (
     undersample_training_data,
+)
+from common.views import (
     streamlit_2columns_metrics_df_shape,
     streamlit_2columns_metrics_series,
     streamlit_2columns_metrics_pct_series,
 )
+# Initialize dataframe session state
 def initialise_data() -> Tuple[Dataset, SplitDataset]:
+    if "input_data_frame" not in st.session_state:
+        st.session_state.input_data_frame = pd.read_csv(
+            r"./data/processed/cr_loan_w2.csv"
+        )
+    if "dataset" not in st.session_state:
+        df = cast(pd.DataFrame, st.session_state.input_data_frame)
+        dataset = Dataset(
+            df=df,
+            random_state=123235,
+            test_size=40,
+        )
+        st.session_state.dataset = dataset
+    else:
+        dataset = st.session_state.dataset
     st.write(
         "Assuming data is already cleaned and relevant features (predictors) added."
         st.dataframe(dataset.df)
         streamlit_2columns_metrics_df_shape(dataset.df)
+    st.header("Predictors")
+    possible_columns = dataset.x_values_column_names
+    selected_columns = st.sidebar.multiselect(
+        label="Select Predictors",
+        options=possible_columns,
+        default=possible_columns,
+    )
+    selected_x_values = dataset.x_values_filtered_columns(selected_columns)
+    st.sidebar.metric(
+        label="# of Predictors Selected",
+        value=selected_x_values.shape[1],
+        delta=None,
+        delta_color="normal",
+    )
     with st.expander("Predictors Dataframe (X)"):
         st.dataframe(selected_x_values)
         streamlit_2columns_metrics_df_shape(selected_x_values)
+    # 40% of data used for training
+    # 14321 as random seed for reproducability
     st.header("Split Testing and Training Data")
     test_size_slider_col, seed_col = st.columns(2)
     split_dataset = dataset.train_test_split(selected_x_values)
+    # Series
     true_status = split_dataset.y_test.to_frame().value_counts()
     st.sidebar.metric(

requirements.txt CHANGED Viewed

Binary files a/requirements.txt and b/requirements.txt differ

setup.py DELETED Viewed

@@ -1,10 +0,0 @@
-from setuptools import find_packages, setup
-setup(
-    name='src',
-    packages=find_packages(),
-    version='0.1.0',
-    description='Tool for credit risk modelling',
-    author='Author',
-    license='MIT',
-)

setup.sh DELETED Viewed

@@ -1,13 +0,0 @@
-mkdir -p ~/.streamlit/
-cat << EOF > ~/.streamlit/credentials.toml
-[general]
-email = "paul.r.kiage@gmail.com"
-EOF
-cat << EOF > ~/.streamlit/config.toml
-[server]
-headless = true
-enableCORS = true
-port = $PORT
-EOF

src/__main__.py DELETED Viewed

File without changes

src/models/__init__.py DELETED Viewed

File without changes

src/models/logistic_model.py DELETED Viewed

@@ -1,33 +0,0 @@
-from  features.build_features import SplitDataset
-from  models.logistic_train_model import logistic_train_model
-from  models.logistic_predict_model import logistic_predict_model
-from  models.logistic_test_model import logistic_test_model
-from  models.util_model_class import ModelClass
-def logistic_class(split_dataset: SplitDataset, currency: str) -> ModelClass:
-    # Train Model
-    clf_logistic_model = logistic_train_model(split_dataset)
-    # Predict using Trained Model
-    clf_logistic_predictions = logistic_predict_model(
-        clf_logistic_model, split_dataset)
-    # Test and Evaluate Model
-    df_trueStatus_probabilityDefault_threshStatus_loanAmount_logistic = logistic_test_model(
-        clf_logistic_model,
-        split_dataset,
-        currency,
-        clf_logistic_predictions.probability_threshold_selected,
-        clf_logistic_predictions.predicted_default_status)
-    return ModelClass(
-        model=clf_logistic_model,
-        trueStatus_probabilityDefault_threshStatus_loanAmount_df=df_trueStatus_probabilityDefault_threshStatus_loanAmount_logistic,
-        probability_threshold_selected=clf_logistic_predictions.probability_threshold_selected,
-        predicted_default_status=clf_logistic_predictions.predicted_default_status,
-        prediction_probability_df=clf_logistic_predictions.prediction_probability_df,
-    )

src/models/logistic_predict_model.py DELETED Viewed

@@ -1,4 +0,0 @@
-from  models.util_predict_model import make_prediction_view
-logistic_predict_model = make_prediction_view(
-    "Logistic", "Logisitic Model")

src/models/logistic_test_model.py DELETED Viewed

@@ -1,4 +0,0 @@
-from  models.util_test import make_tests_view
-logistic_test_model = make_tests_view(
-    "Logistic", "Logistic Model")

src/models/logistic_train_model.py DELETED Viewed

@@ -1,69 +0,0 @@
-import numpy as np
-from sklearn.linear_model import LogisticRegression
-from  features.build_features import SplitDataset
-import streamlit as st
-import pandas as pd
-from  visualization.graphs_logistic import plot_logistic_coeff_barh
-@st.cache(suppress_st_warning=True)
-def create_clf_logistic_model(X_train, y_train):
-    # Create and fit the logistic regression model
-    return LogisticRegression(solver="lbfgs").fit(X_train, np.ravel(y_train))
-@st.cache(suppress_st_warning=True)
-def create_coeff_dict_logistic_model(
-    logistic_model, training_data
-):
-    return {
-        feat: coef
-        for coef, feat in zip(
-            logistic_model.coef_[0, :], training_data.columns
-        )
-    }
-def coeff_dict_to_sorted_df(coef_dict):
-    coef_dict_sorted = dict(
-        sorted(coef_dict.items(), key=lambda item: item[1], reverse=False)
-    )
-    data_items = coef_dict_sorted.items()
-    data_list = list(data_items)
-    return pd.DataFrame(data_list, columns=["Coefficient", "Value"])
-def interpret_clf_logistic_model(clf_logistic_model, split_dataset):
-    st.metric(
-        label="# of Coefficients in Logistic Regression",
-        value=clf_logistic_model.n_features_in_,
-        delta=None,
-        delta_color="normal",
-    )
-    st.subheader("Logistic Regression Coefficient Values")
-    coef_dict = create_coeff_dict_logistic_model(
-        clf_logistic_model, split_dataset.X_y_train)
-    df = coeff_dict_to_sorted_df(coef_dict)
-    fig = plot_logistic_coeff_barh(df)
-    st.plotly_chart(fig)
-def logistic_train_model(split_dataset: SplitDataset):
-    st.header("Logistic Regression Model")
-    clf_logistic_model = create_clf_logistic_model(
-        split_dataset.X_train, split_dataset.y_train
-    )
-    interpret_clf_logistic_model(clf_logistic_model, split_dataset)
-    return clf_logistic_model

src/models/util_predict_model.py DELETED Viewed

@@ -1,87 +0,0 @@
-from typing import Union, cast
-from sklearn.linear_model import LogisticRegression
-import pandas as pd
-from dataclasses import dataclass
-from xgboost import XGBClassifier
-from  features.util_build_features import SplitDataset
-from  models.util_predict_model_threshold import (
-    user_defined_probability_threshold,
-    J_statistic_driven_probability_threshold,
-    tradeoff_threshold,
-    acceptance_rate_driven_threshold,
-    select_probability_threshold,
-    model_probability_values_df)
-import streamlit as st
-def probability_threshold_explainer(model_name):
-    st.write(
-        f"""
-            The {model_name} model (obtained using training data) is applied on testing data to predict the loans probabilities of defaulting.\n
-            Probabilities of defaulting of the loans are compared to a probability threshold.\n
-            A loan is predicted to default if its predicted probability of defaulting is greater than the probability threshold.
-            """
-    )
-@dataclass(frozen=True)
-class Threshold:
-    probability_threshold_selected: float
-    predicted_default_status: pd.Series
-    prediction_probability_df: pd.DataFrame
-def make_prediction_view(
-    model_name_short: str,
-    model_name: str,
-):
-    def view(
-        clf_xgbt_model: Union[XGBClassifier, LogisticRegression],
-        split_dataset: SplitDataset,
-    ) -> Threshold:
-        probability_threshold_explainer(model_name)
-        clf_prediction_prob_df_gbt = model_probability_values_df(
-            clf_xgbt_model,
-            split_dataset.X_test,
-        )
-        (clf_thresh_predicted_default_status_user_gbt,
-         user_threshold
-         ) = user_defined_probability_threshold(
-            model_name_short, clf_xgbt_model, split_dataset)
-        (clf_thresh_predicted_default_status_Jstatistic_gbt,
-         J_statistic_best_threshold) = J_statistic_driven_probability_threshold(
-            clf_prediction_prob_df_gbt, clf_xgbt_model, split_dataset)
-        tradeoff_threshold(clf_prediction_prob_df_gbt, split_dataset)
-        (acc_rate_thresh_gbt,
-         clf_thresh_predicted_default_status_acceptance_gbt) = acceptance_rate_driven_threshold(model_name_short, clf_prediction_prob_df_gbt)
-        (prob_thresh_selected_gbt,
-         predicted_default_status_gbt) = select_probability_threshold(model_name_short,
-                                                                      user_threshold,
-                                                                      clf_thresh_predicted_default_status_user_gbt,
-                                                                      J_statistic_best_threshold,
-                                                                      clf_thresh_predicted_default_status_Jstatistic_gbt,
-                                                                      acc_rate_thresh_gbt,
-                                                                      clf_thresh_predicted_default_status_acceptance_gbt)
-        return Threshold(
-            probability_threshold_selected=cast(
-                float, prob_thresh_selected_gbt
-            ),
-            predicted_default_status=predicted_default_status_gbt,
-            prediction_probability_df=clf_prediction_prob_df_gbt,
-        )
-    return view

src/models/util_predict_model_threshold.py DELETED Viewed

@@ -1,310 +0,0 @@
-import streamlit as st
-from sklearn.metrics import classification_report, roc_curve
-import numpy as np
-import plotly.express as px
-import pandas as pd
-from numpy import argmax
-from  visualization.metrics import streamlit_2columns_metrics_df, streamlit_2columns_metrics_pct_df
-from  visualization.graphs_threshold import acceptance_rate_driven_threshold_graph
-def model_probability_values_df(model, X):
-    return pd.DataFrame(model.predict_proba(X)[:, 1], columns=["PROB_DEFAULT"])
-def find_best_threshold_J_statistic(y, clf_prediction_prob_df):
-    fpr, tpr, thresholds = roc_curve(y, clf_prediction_prob_df)
-    # get the best threshold
-    # Youden’s J statistic tpr-fpr
-    # Argmax to get the index in
-    # thresholds
-    return thresholds[argmax(tpr - fpr)]
-# Function that makes dataframe with probability of default, predicted default status based on threshold
-# and actual default status
-def classification_report_per_threshold(
-    threshold_list, threshold_default_status_list, y_test
-):
-    target_names = ["Non-Default", "Default"]
-    classification_report_list = []
-    for threshold_default_status in threshold_default_status_list:
-        thresh_classification_report = classification_report(
-            y_test,
-            threshold_default_status,
-            target_names=target_names,
-            output_dict=True,
-            zero_division=0,
-        )
-        classification_report_list.append(thresh_classification_report)
-    # Return threshold classification report dict
-    return dict(zip(threshold_list, classification_report_list))
-def thresh_classification_report_recall_accuracy(
-    thresh_classification_report_dict,
-):
-    thresh_def_recalls_list = []
-    thresh_nondef_recalls_list = []
-    thresh_accs_list = []
-    for x in [*thresh_classification_report_dict]:
-        thresh_def_recall = thresh_classification_report_dict[x]["Default"][
-            "recall"
-        ]
-        thresh_def_recalls_list.append(thresh_def_recall)
-        thresh_nondef_recall = thresh_classification_report_dict[x][
-            "Non-Default"
-        ]["recall"]
-        thresh_nondef_recalls_list.append(thresh_nondef_recall)
-        thresh_accs = thresh_classification_report_dict[x]["accuracy"]
-        thresh_accs_list.append(thresh_accs)
-    return [
-        thresh_def_recalls_list,
-        thresh_nondef_recalls_list,
-        thresh_accs_list,
-    ]
-def apply_threshold_to_probability_values(probability_values, threshold):
-    return (
-        probability_values["PROB_DEFAULT"]
-        .apply(lambda x: 1 if x > threshold else 0)
-        .rename("PREDICT_DEFAULT_STATUS")
-    )
-@st.cache(suppress_st_warning=True)
-def find_best_threshold_J_statistic(y, clf_prediction_prob_df):
-    fpr, tpr, thresholds = roc_curve(y, clf_prediction_prob_df)
-    # get the best threshold
-    J = tpr - fpr  # Youden’s J statistic
-    ix = argmax(J)
-    return thresholds[ix]
-def default_status_per_threshold(threshold_list, prob_default):
-    threshold_default_status_list = []
-    for threshold in threshold_list:
-        threshold_default_status = prob_default.apply(
-            lambda x: 1 if x > threshold else 0
-        )
-        threshold_default_status_list.append(threshold_default_status)
-    return threshold_default_status_list
-def threshold_and_predictions(clf_xgbt_model, split_dataset, threshold):
-    clf_prediction_prob_df_gbt = model_probability_values_df(
-        clf_xgbt_model,
-        split_dataset.X_test,
-    )
-    clf_thresh_predicted_default_status = (
-        apply_threshold_to_probability_values(
-            clf_prediction_prob_df_gbt,
-            threshold,
-        )
-    )
-    streamlit_2columns_metrics_df(
-        "# of Predicted Defaults",
-        "# of Predicted Non-Default",
-        clf_thresh_predicted_default_status,
-    )
-    streamlit_2columns_metrics_pct_df(
-        "% of Loans Predicted to Default",
-        "% of Loans Predicted not to Default",
-        clf_thresh_predicted_default_status,
-    )
-    return clf_thresh_predicted_default_status
-def user_defined_probability_threshold(model_name_short, clf_xgbt_model, split_dataset):
-    st.subheader("Classification Probability Threshold - User Defined")
-    user_defined_threshold = st.slider(
-        label="Default Probability Threshold:",
-        min_value=0.0,
-        max_value=1.0,
-        value=0.8,
-        key=f"threshold_{model_name_short}_default",
-    )
-    clf_thresh_predicted_default_status = threshold_and_predictions(
-        clf_xgbt_model, split_dataset, user_defined_threshold)
-    return clf_thresh_predicted_default_status, user_defined_threshold
-def J_statistic_driven_probability_threshold(clf_prediction_prob_df_gbt, clf_xgbt_model, split_dataset):
-    st.subheader("J Statistic Driven Classification Probability Threshold")
-    J_statistic_best_threshold = find_best_threshold_J_statistic(
-        split_dataset.y_test, clf_prediction_prob_df_gbt
-    )
-    st.metric(
-        label="Youden's J statistic calculated best threshold",
-        value=J_statistic_best_threshold,
-    )
-    clf_thresh_predicted_default_status = threshold_and_predictions(
-        clf_xgbt_model, split_dataset, J_statistic_best_threshold)
-    return clf_thresh_predicted_default_status, J_statistic_best_threshold
-def create_tradeoff_graph(df):
-    fig2 = px.line(
-        data_frame=df,
-        y=["Default Recall", "Non Default Recall", "Accuracy"],
-        x="Threshold",
-    )
-    fig2.update_layout(
-        title="Recall and Accuracy score Trade-off with Probability Threshold",
-        xaxis_title="Probability Threshold",
-        yaxis_title="Score",
-    )
-    fig2.update_yaxes(range=[0.0, 1.0])
-    st.plotly_chart(fig2)
-def tradeoff_threshold(clf_prediction_prob_df_gbt, split_dataset):
-    st.subheader(
-        "Recall and Accuracy Tradeoff with given Probability Threshold"
-    )
-    threshold_list = np.arange(
-        0, 1, 0.025).round(decimals=3).tolist()
-    threshold_default_status_list = default_status_per_threshold(
-        threshold_list, clf_prediction_prob_df_gbt["PROB_DEFAULT"]
-    )
-    thresh_classification_report_dict = (
-        classification_report_per_threshold(
-            threshold_list,
-            threshold_default_status_list,
-            split_dataset.y_test,
-        )
-    )
-    (
-        thresh_def_recalls_list,
-        thresh_nondef_recalls_list,
-        thresh_accs_list,
-    ) = thresh_classification_report_recall_accuracy(
-        thresh_classification_report_dict
-    )
-    namelist = [
-        "Default Recall",
-        "Non Default Recall",
-        "Accuracy",
-        "Threshold",
-    ]
-    df = pd.DataFrame(
-        [
-            thresh_def_recalls_list,
-            thresh_nondef_recalls_list,
-            thresh_accs_list,
-            threshold_list,
-        ],
-        index=namelist,
-    )
-    df = df.T
-    create_tradeoff_graph(df)
-def select_probability_threshold(model_name_short,
-                                 user_defined_threshold,
-                                 clf_thresh_predicted_default_status_user_gbt,
-                                 J_statistic_best_threshold,
-                                 clf_thresh_predicted_default_status_Jstatistic_gbt,
-                                 acc_rate_thresh_gbt,
-                                 clf_thresh_predicted_default_status_acceptance_gbt):
-    st.subheader("Selected Probability Threshold")
-    options = [
-        "User Defined",
-        "J Statistic Driven",
-        "Acceptance Rate Driven",
-    ]
-    prob_thresh_option = st.radio(
-        label="Selected Probability Threshold",
-        options=options,
-        key=f"{model_name_short}_radio_thresh",
-    )
-    if prob_thresh_option == "User Defined":
-        prob_thresh_selected_gbt = user_defined_threshold
-        predicted_default_status_gbt = (
-            clf_thresh_predicted_default_status_user_gbt
-        )
-    elif prob_thresh_option == "J Statistic Driven":
-        prob_thresh_selected_gbt = J_statistic_best_threshold
-        predicted_default_status_gbt = (
-            clf_thresh_predicted_default_status_Jstatistic_gbt
-        )
-    else:
-        prob_thresh_selected_gbt = acc_rate_thresh_gbt
-        predicted_default_status_gbt = (
-            clf_thresh_predicted_default_status_acceptance_gbt
-        )
-    st.write(
-        f"Selected probability threshold is {prob_thresh_selected_gbt}"
-    )
-    return prob_thresh_selected_gbt, predicted_default_status_gbt
-def acceptance_rate_driven_threshold(model_name_short, clf_prediction_prob_df_gbt):
-    st.subheader("Acceptance Rate Driven Probability Threshold")
-    # Steps
-    # Set acceptance rate
-    # Get default status per threshold
-    # Get classification report per threshold
-    # Get recall, nondef recall, and accuracy per threshold
-    acceptance_rate = (
-        st.slider(
-            label="% of loans accepted (acceptance rate):",
-            min_value=0,
-            max_value=100,
-            value=85,
-            key=f"acceptance_rate_{model_name_short}",
-            format="%f%%",
-        )
-        / 100
-    )
-    acc_rate_thresh_gbt = np.quantile(
-        clf_prediction_prob_df_gbt["PROB_DEFAULT"], acceptance_rate
-    )
-    st.write(
-        f"An acceptance rate of {acceptance_rate} results in probability threshold of {acc_rate_thresh_gbt}"
-    )
-    acceptance_rate_driven_threshold_graph(
-        clf_prediction_prob_df_gbt, acc_rate_thresh_gbt)
-    clf_thresh_predicted_default_status_acceptance_gbt = apply_threshold_to_probability_values(
-        clf_prediction_prob_df_gbt,
-        acc_rate_thresh_gbt,
-    )
-    return acc_rate_thresh_gbt, clf_thresh_predicted_default_status_acceptance_gbt

src/models/xgboost_model.py DELETED Viewed

@@ -1,33 +0,0 @@
-from  features.build_features import SplitDataset
-from  models.xgboost_train_model import xgboost_train_model
-from  models.xgboost_predict_model import xgboost_predit_model
-from  models.xgboost_test_model import xgboost_test_model
-from  models.util_model_class import ModelClass
-def xgboost_class(split_dataset: SplitDataset, currency: str):
-    # Train Model
-    clf_xgbt_model = xgboost_train_model(split_dataset)
-    # Predit using Trained Model
-    clf_xgbt_predictions = xgboost_predit_model(
-        clf_xgbt_model, split_dataset)
-    # Test and Evaluate Model
-    df_trueStatus_probabilityDefault_threshStatus_loanAmount_xgbt = xgboost_test_model(
-        clf_xgbt_model,
-        split_dataset,
-        currency,
-        clf_xgbt_predictions.probability_threshold_selected,
-        clf_xgbt_predictions.predicted_default_status)
-    return ModelClass(
-        model=clf_xgbt_model,
-        trueStatus_probabilityDefault_threshStatus_loanAmount_df=df_trueStatus_probabilityDefault_threshStatus_loanAmount_xgbt,
-        probability_threshold_selected=clf_xgbt_predictions.probability_threshold_selected,
-        predicted_default_status=clf_xgbt_predictions.predicted_default_status,
-        prediction_probability_df=clf_xgbt_predictions.prediction_probability_df,
-    )

src/models/xgboost_predict_model.py DELETED Viewed

@@ -1,4 +0,0 @@
-from  models.util_predict_model import make_prediction_view
-xgboost_predit_model = make_prediction_view(
-    "XGBoost", "Gradient Boosted Tree with XGBoost")

src/models/xgboost_test_model.py DELETED Viewed

@@ -1,4 +0,0 @@
-from  models.util_test import make_tests_view
-xgboost_test_model = make_tests_view(
-    "XGBoost", "Gradient Boosted Tree with XGBoost")

src/models/xgboost_train_model.py DELETED Viewed

@@ -1,68 +0,0 @@
-import pickle
-import numpy as np
-import xgboost as xgb
-from  features.build_features import SplitDataset
-import streamlit as st
-from  visualization.graphs_decision_tree import(plot_importance_gbt,
-                                                   plot_tree_gbt)
-from  visualization.graphs_settings import streamlit_chart_setting_height_width
-from  visualization.graphs_download import (download_importance_gbt,
-                                               download_tree_gbt)
-@ st.cache(suppress_st_warning=True, hash_funcs={
-    xgb.XGBClassifier: pickle.dumps
-})
-def create_clf_xgbt_model(X_train, y_train):
-    # Using hyperparameters learning_rate and max_depth
-    return xgb.XGBClassifier(
-        learning_rate=0.1,
-        max_depth=7,
-        use_label_encoder=False,
-        eval_metric="logloss",
-    ).fit(X_train, np.ravel(y_train), eval_metric="logloss")
-def interpret_clf_xgbt_model(clf_xgbt_model):
-    st.subheader("XGBoost Decision Tree Feature Importance")
-    (barxsize, barysize,) = streamlit_chart_setting_height_width(
-        "Chart Settings", 10, 15, "barxsize", "barysize"
-    )
-    fig1 = plot_importance_gbt(clf_xgbt_model, barxsize, barysize)
-    st.pyplot(fig1)
-    download_importance_gbt(fig1, barxsize, barysize)
-    st.subheader("XGBoost Decision Tree Structure")
-    (treexsize, treeysize,) = streamlit_chart_setting_height_width(
-        "Chart Settings", 5, 5, "treexsize", "treeysize"
-    )
-    fig2 = plot_tree_gbt(treexsize, treeysize, clf_xgbt_model)
-    st.pyplot(fig2)
-    download_tree_gbt(treexsize, treeysize)
-    st.markdown(
-        "Note: The downloaded XGBoost Decision Tree plot chart in png has higher resolution than that displayed here."
-    )
-def xgboost_train_model(split_dataset: SplitDataset):
-    st.header("XGBoost Decision Trees")
-    clf_xgbt_model = create_clf_xgbt_model(
-        split_dataset.X_train, split_dataset.y_train
-    )
-    interpret_clf_xgbt_model(clf_xgbt_model)
-    return clf_xgbt_model

src/visualization/__init__.py DELETED Viewed

File without changes

src/visualization/graphs_decision_tree.py DELETED Viewed

@@ -1,23 +0,0 @@
-import xgboost as xgb
-import streamlit as st
-import matplotlib.pyplot as plt
-from xgboost import plot_tree
-def plot_importance_gbt(clf_xgbt_model, barxsize, barysize):
-    axobject1 = xgb.plot_importance(clf_xgbt_model, importance_type="weight")
-    fig1 = axobject1.figure
-    st.write("Feature Importance Plot (Gradient Boosted Tree)")
-    fig1.set_size_inches(barxsize, barysize)
-    return fig1
-def plot_tree_gbt(treexsize, treeysize, clf_xgbt_model):
-    plot_tree(clf_xgbt_model)
-    fig2 = plt.gcf()
-    fig2.set_size_inches(treexsize, treeysize)
-    return fig2

src/visualization/graphs_download.py DELETED Viewed

@@ -1,17 +0,0 @@
-import streamlit as st
-import matplotlib.pyplot as plt
-def download_importance_gbt(fig1, barxsize, barysize):
-    if st.button(
-        "Download Feature Importance Plot as png (Gradient Boosted Tree)"
-    ):
-        dpisize = max(barxsize, barysize)
-        plt.savefig("bar.png", dpi=dpisize * 96, bbox_inches="tight")
-        fig1.set_size_inches(barxsize, barysize)
-def download_tree_gbt(treexsize, treeysize):
-    if st.button("Download XGBoost Decision Tree Plot as png (Gradient Boosted Tree)"):
-        dpisize = max(treexsize, treeysize)
-        plt.savefig("tree.png", dpi=dpisize * 96, bbox_inches="tight")

src/visualization/graphs_logistic.py DELETED Viewed

@@ -1,12 +0,0 @@
-import plotly.express as px
-def plot_logistic_coeff_barh(df):
-    fig = px.bar(data_frame=df, x="Value",
-                 y="Coefficient", orientation="h")
-    fig.update_layout(
-        title="Logistic Regression Coefficients",
-        xaxis_title="Value",
-        yaxis_title="Coefficient",)
-    return fig

src/visualization/graphs_settings.py DELETED Viewed

@@ -1,28 +0,0 @@
-import streamlit as st
-def streamlit_chart_setting_height_width(
-    title: str,
-    default_widthvalue: int,
-    default_heightvalue: int,
-    widthkey: str,
-    heightkey: str,
-):
-    with st.expander(title):
-        lbarx_col, lbary_col = st.columns(2)
-        with lbarx_col:
-            width_size = st.number_input(
-                label="Width in inches:",
-                value=default_widthvalue,
-                key=widthkey,
-            )
-        with lbary_col:
-            height_size = st.number_input(
-                label="Height in inches:",
-                value=default_heightvalue,
-                key=heightkey,
-            )
-    return width_size, height_size

src/visualization/graphs_test.py DELETED Viewed

@@ -1,78 +0,0 @@
-from matplotlib import pyplot as plt
-from sklearn.metrics import roc_curve
-from typing import OrderedDict
-from  models.util_model_class import ModelClass
-from sklearn.calibration import calibration_curve
-def cross_validation_graph(cv, eval_metric, trees):
-    # Plot the test AUC scores for each iteration
-    fig = plt.figure()
-    plt.plot(cv[cv.columns[2]])
-    plt.title(
-        "Test {eval_metric} Score Over {it_numbr} Iterations".format(
-            eval_metric=eval_metric, it_numbr=trees
-        )
-    )
-    plt.xlabel("Iteration Number")
-    plt.ylabel("Test {eval_metric} Score".format(eval_metric=eval_metric))
-    return fig
-def roc_auc_compare_n_models(y, model_views: OrderedDict[str, ModelClass]):
-    colors = ["blue", "green"]
-    fig = plt.figure()
-    for color_idx, (model_name, model_view) in enumerate(model_views.items()):
-        fpr, tpr, _thresholds = roc_curve(
-            y, model_view.prediction_probability_df
-        )
-        plt.plot(fpr, tpr, color=colors[color_idx], label=f"{model_name}")
-    plt.plot([0, 1], [0, 1], linestyle="--", label="Random Prediction")
-    model_names = list(model_views.keys())
-    if not model_names:
-        model_name_str = "None"
-    elif len(model_names) == 1:
-        model_name_str = model_names[0]
-    else:
-        model_name_str = " and ".join(
-            [", ".join(model_names[:-1]), model_names[-1]]
-        )
-    plt.title(f"ROC Chart for {model_name_str} on the Probability of Default")
-    plt.xlabel("False Positive Rate (FP Rate)")
-    plt.ylabel("True Positive Rate (TP Rate)")
-    plt.legend()
-    plt.grid(False)
-    plt.xlim(0, 1)
-    plt.ylim(0, 1)
-    return fig
-def calibration_curve_report_commented_n(
-    y, model_views: OrderedDict[str, ModelClass], bins: int
-):
-    fig = plt.figure()
-    for model_name, model_view in model_views.items():
-        frac_of_pos, mean_pred_val = calibration_curve(
-            y,
-            model_view.prediction_probability_df,
-            n_bins=bins,
-            normalize=True,
-        )
-        plt.plot(mean_pred_val, frac_of_pos, "s-", label=f"{model_name}")
-    # Create the calibration curve plot with the guideline
-    plt.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
-    plt.ylabel("Fraction of positives")
-    plt.xlabel("Average Predicted Probability")
-    plt.title("Calibration Curve")
-    plt.legend()
-    plt.grid(False)
-    plt.xlim(0, 1)
-    plt.ylim(0, 1)
-    return fig

src/visualization/graphs_threshold.py DELETED Viewed

@@ -1,80 +0,0 @@
-import plotly.express as px
-import streamlit as st
-import matplotlib.pyplot as plt
-import numpy as np
-def acceptance_rate_driven_threshold_graph(clf_prediction_prob_df_gbt, acc_rate_thresh_gbt):
-    figa = px.histogram(clf_prediction_prob_df_gbt["PROB_DEFAULT"])
-    figa.update_layout(
-        title="Acceptance Rate Threshold vs. Loans Accepted",
-        xaxis_title="Acceptance Rate Threshold",
-        yaxis_title="Loans Accepted",
-    )
-    figa.update_traces(marker_line_width=1, marker_line_color="white")
-    figa.add_vline(
-        x=acc_rate_thresh_gbt,
-        line_width=3,
-        line_dash="solid",
-        line_color="red",
-    )
-    st.plotly_chart(figa)
-def recall_accuracy_threshold_tradeoff_fig(
-    widthsize,
-    heightsize,
-    threshold_list,
-    thresh_def_recalls_list,
-    thresh_nondef_recalls_list,
-    thresh_accs_list,
-):
-    fig = plt.figure(figsize=(widthsize, heightsize))
-    plt.plot(threshold_list, thresh_def_recalls_list, label="Default Recall")
-    plt.plot(
-        threshold_list, thresh_nondef_recalls_list, label="Non-Default Recall"
-    )
-    plt.plot(threshold_list, thresh_accs_list, label="Model Accuracy")
-    plt.xlabel("Probability Threshold")
-    plt.ylabel("Score")
-    plt.xlim(0, 1)
-    plt.ylim(0, 1)
-    plt.legend()
-    plt.title("Recall and Accuracy Score Tradeoff with Probability Threshold")
-    plt.grid(False)
-    return fig
-def acceptance_rate_threshold_fig(probability_default, acceptancerate, bins):
-    # Probability distribution
-    probability_stat_distribution = probability_default.describe()
-    # Acceptance rate threshold
-    acc_rate_thresh = np.quantile(probability_default, acceptancerate)
-    fig = plt.figure()
-    plt.hist(
-        probability_default,
-        color="blue",
-        bins=bins,
-        histtype="bar",
-        ec="white",
-    )
-    # Add a reference line to the plot for the threshold
-    plt.axvline(x=acc_rate_thresh, color="red")
-    plt.title("Acceptance Rate Thershold")
-    return (
-        fig,
-        probability_stat_distribution,
-        acc_rate_thresh,
-    )

src/visualization/metrics.py DELETED Viewed

@@ -1,132 +0,0 @@
-import pandas as pd
-import streamlit as st
-def streamlit_2columns_metrics_pct_df(
-    column1name_label: str,
-    column2name_label: str,
-    df: pd.DataFrame,
-):
-    (
-        column1name,
-        column2name,
-    ) = st.columns(2)
-    with column1name:
-        st.metric(
-            label=column1name_label,
-            value="{:.0%}".format(df.value_counts().get(1) / df.shape[0]),
-            delta=None,
-            delta_color="normal",
-        )
-    with column2name:
-        st.metric(
-            label=column2name_label,
-            value="{:.0%}".format(df.value_counts().get(0) / df.shape[0]),
-            delta=None,
-            delta_color="normal",
-        )
-def streamlit_2columns_metrics_df(
-    column1name_label: str,
-    column2name_label: str,
-    df: pd.DataFrame,
-):
-    (
-        column1name,
-        column2name,
-    ) = st.columns(2)
-    with column1name:
-        st.metric(
-            label=column1name_label,
-            value=df.value_counts().get(1),
-            delta=None,
-            delta_color="normal",
-        )
-    with column2name:
-        st.metric(
-            label=column2name_label,
-            value=df.value_counts().get(0),
-            delta=None,
-            delta_color="normal",
-        )
-def streamlit_2columns_metrics_df_shape(df: pd.DataFrame):
-    (
-        column1name,
-        column2name,
-    ) = st.columns(2)
-    with column1name:
-        st.metric(
-            label="Rows",
-            value=df.shape[0],
-            delta=None,
-            delta_color="normal",
-        )
-    with column2name:
-        st.metric(
-            label="Columns",
-            value=df.shape[1],
-            delta=None,
-            delta_color="normal",
-        )
-def streamlit_2columns_metrics_pct_series(
-    column1name_label: str,
-    column2name_label: str,
-    series: pd.Series,
-):
-    (
-        column1name,
-        column2name,
-    ) = st.columns(2)
-    with column1name:
-        st.metric(
-            label=column1name_label,
-            value="{:.0%}".format(series.get(1) / series.sum()),
-            delta=None,
-            delta_color="normal",
-        )
-    with column2name:
-        st.metric(
-            label=column2name_label,
-            value="{:.0%}".format(series.get(0) / series.sum()),
-            delta=None,
-            delta_color="normal",
-        )
-def streamlit_2columns_metrics_series(
-    column1name_label: str,
-    column2name_label: str,
-    series: pd.Series,
-):
-    (
-        column1name,
-        column2name,
-    ) = st.columns(2)
-    with column1name:
-        st.metric(
-            label=column1name_label,
-            value=series.get(1),
-            delta=None,
-            delta_color="normal",
-        )
-    with column2name:
-        st.metric(
-            label=column2name_label,
-            value=series.get(0),
-            delta=None,
-            delta_color="normal",
-        )

{src/features → views}/__init__.py RENAMED Viewed

File without changes

views/decision_tree.py ADDED Viewed

	@@ -0,0 +1,70 @@

+from common.data import SplitDataset
+import streamlit as st
+from common.util import (
+    test_variables_gbt,
+)
+from common.views import (
+    streamlit_chart_setting_height_width,
+    plot_importance_gbt,
+    plot_tree_gbt,
+    download_importance_gbt,
+    download_tree_gbt,
+)
+from views.typing import ModelView
+from views.threshold import decision_tree_threshold_view
+from views.evaluation import decision_tree_evaluation_view
+def decisiontree_view(split_dataset: SplitDataset, currency: str):
+    st.header("Decision Trees")
+    clf_gbt_model = test_variables_gbt(
+        split_dataset.X_train, split_dataset.y_train
+    )
+    st.subheader("Decision Tree Feature Importance")
+    (barxsize, barysize,) = streamlit_chart_setting_height_width(
+        "Chart Settings", 10, 15, "barxsize", "barysize"
+    )
+    fig1 = plot_importance_gbt(clf_gbt_model, barxsize, barysize)
+    st.pyplot(fig1)
+    download_importance_gbt(fig1, barxsize, barysize)
+    st.subheader("Decision Tree Structure")
+    (treexsize, treeysize,) = streamlit_chart_setting_height_width(
+        "Chart Settings", 15, 10, "treexsize", "treeysize"
+    )
+    fig2 = plot_tree_gbt(treexsize, treeysize, clf_gbt_model)
+    st.pyplot(fig2)
+    download_tree_gbt(treexsize, treeysize)
+    st.markdown(
+        "Note: The downloaded decision tree plot chart in png has higher resolution than that displayed here."
+    )
+    threshold = decision_tree_threshold_view(clf_gbt_model, split_dataset)
+    df_trueStatus_probabilityDefault_threshStatus_loanAmount = (
+        decision_tree_evaluation_view(
+            clf_gbt_model,
+            split_dataset,
+            currency,
+            threshold.probability_threshold_selected,
+            threshold.predicted_default_status,
+        )
+    )
+    return ModelView(
+        model=clf_gbt_model,
+        trueStatus_probabilityDefault_threshStatus_loanAmount_df=df_trueStatus_probabilityDefault_threshStatus_loanAmount,
+        probability_threshold_selected=threshold.probability_threshold_selected,
+        predicted_default_status=threshold.predicted_default_status,
+        prediction_probability_df=threshold.prediction_probability_df,
+    )

src/models/util_test.py → views/evaluation.py RENAMED Viewed

@@ -1,6 +1,5 @@
 from typing import Union
 import pandas as pd
-from sklearn.model_selection import StratifiedKFold, cross_val_score
 import streamlit as st
 import numpy as np
 from sklearn.metrics import (
@@ -8,25 +7,24 @@ from sklearn.metrics import (
     confusion_matrix,
 )
 from sklearn.linear_model import LogisticRegression
-import xgboost as xgb
 from xgboost.sklearn import XGBClassifier
-from  features.util_build_features import SplitDataset
-"""from  models.model_utils import (
     create_cross_validation_df,
     cross_validation_scores,
     get_df_trueStatus_probabilityDefault_threshStatus_loanAmount,
-)"""
-from  visualization.graphs_test import (
     cross_validation_graph,
 )
-def make_tests_view(
     model_name_short: str,
     model_name_generic: str,
 ):
     def view(
-        clf_xgbt_model: Union[XGBClassifier, LogisticRegression],
         split_dataset: SplitDataset,
         currency: str,
         prob_thresh_selected,
@@ -42,7 +40,7 @@ def make_tests_view(
             train on each fold suggests performance will be stable."
         )
-        st.write(f'{model_name_short} cross validation test:')
         stcol_seed, stcol_eval_metric = st.columns(2)
@@ -172,7 +170,7 @@ def make_tests_view(
             )
         cv_scores = cross_validation_scores(
-            clf_xgbt_model,
             split_dataset.X_test,
             split_dataset.y_test,
             nfolds_score,
@@ -327,7 +325,7 @@ def make_tests_view(
         df_trueStatus_probabilityDefault_threshStatus_loanAmount = (
             get_df_trueStatus_probabilityDefault_threshStatus_loanAmount(
-                clf_xgbt_model,
                 split_dataset.X_test,
                 split_dataset.y_test,
                 prob_thresh_selected,
@@ -408,161 +406,5 @@ def make_tests_view(
     return view
-def cross_validation_scores(model, X, y, nfold, score, seed):
-    # return cv scores of metric
-    return cross_val_score(
-        model,
-        np.ascontiguousarray(X),
-        np.ravel(np.ascontiguousarray(y)),
-        cv=StratifiedKFold(n_splits=nfold, shuffle=True, random_state=seed),
-        scoring=score,
-    )
-def create_cross_validation_df(
-    X, y, eval_metric, seed, trees, n_folds, early_stopping_rounds
-):
-    # Test data x and y
-    DTrain = xgb.DMatrix(X, label=y)
-    # auc or logloss
-    params = {
-        "eval_metric": eval_metric,
-        "objective": "binary:logistic",  # logistic say 0 or 1 for loan status
-        "seed": seed,
-    }
-    # Create the data frame of cross validations
-    cv_df = xgb.cv(
-        params,
-        DTrain,
-        num_boost_round=trees,
-        nfold=n_folds,
-        early_stopping_rounds=early_stopping_rounds,
-        shuffle=True,
-    )
-    return [DTrain, cv_df]
-def create_accept_rate_list(start, end, samples):
-    return np.linspace(start, end, samples, endpoint=True)
-def create_strategyTable_df(
-    start, end, samples, actual_probability_predicted_acc_rate, true, currency
-):
-    accept_rates = create_accept_rate_list(start, end, samples)
-    thresholds_strat = []
-    bad_rates_start = []
-    Avg_Loan_Amnt = actual_probability_predicted_acc_rate[true].mean()
-    num_accepted_loans_start = []
-    for rate in accept_rates:
-        # Calculate the threshold for the acceptance rate
-        thresh = np.quantile(
-            actual_probability_predicted_acc_rate["PROB_DEFAULT"], rate
-        ).round(3)
-        # Add the threshold value to the list of thresholds
-        thresholds_strat.append(
-            np.quantile(
-                actual_probability_predicted_acc_rate["PROB_DEFAULT"], rate
-            ).round(3)
-        )
-        # Reassign the loan_status value using the threshold
-        actual_probability_predicted_acc_rate[
-            "PREDICT_DEFAULT_STATUS"
-        ] = actual_probability_predicted_acc_rate["PROB_DEFAULT"].apply(
-            lambda x: 1 if x > thresh else 0
-        )
-        # Create a set of accepted loans using this acceptance rate
-        accepted_loans = actual_probability_predicted_acc_rate[
-            actual_probability_predicted_acc_rate["PREDICT_DEFAULT_STATUS"]
-            == 0
-        ]
-        # Calculate and append the bad rate using the acceptance rate
-        bad_rates_start.append(
-            np.sum((accepted_loans[true]) / len(accepted_loans[true])).round(3)
-        )
-        # Accepted loans
-        num_accepted_loans_start.append(len(accepted_loans))
-    # Calculate estimated value
-    money_accepted_loans = [
-        accepted_loans * Avg_Loan_Amnt
-        for accepted_loans in num_accepted_loans_start
-    ]
-    money_bad_accepted_loans = [
-        2 * money_accepted_loan * bad_rate
-        for money_accepted_loan, bad_rate in zip(
-            money_accepted_loans, bad_rates_start
-        )
-    ]
-    zip_object = zip(money_accepted_loans, money_bad_accepted_loans)
-    estimated_value = [
-        money_accepted_loan - money_bad_accepted_loan
-        for money_accepted_loan, money_bad_accepted_loan in zip_object
-    ]
-    accept_rates = ["{:.2f}".format(elem) for elem in accept_rates]
-    thresholds_strat = ["{:.2f}".format(elem) for elem in thresholds_strat]
-    bad_rates_start = ["{:.2f}".format(elem) for elem in bad_rates_start]
-    estimated_value = ["{:.2f}".format(elem) for elem in estimated_value]
-    return (
-        pd.DataFrame(
-            zip(
-                accept_rates,
-                thresholds_strat,
-                bad_rates_start,
-                num_accepted_loans_start,
-                estimated_value,
-            ),
-            columns=[
-                "Acceptance Rate",
-                "Threshold",
-                "Bad Rate",
-                "Num Accepted Loans",
-                f"Estimated Value ({currency})",
-            ],
-        )
-        .sort_values(by="Acceptance Rate", axis=0, ascending=False)
-        .reset_index(drop=True)
-    )
-def get_df_trueStatus_probabilityDefault_threshStatus_loanAmount(
-    model, X, y, threshold, loan_amount_col_name
-):
-    true_status = y.to_frame()
-    loan_amount = X[loan_amount_col_name]
-    clf_prediction_prob = model.predict_proba(np.ascontiguousarray(X))
-    clf_prediction_prob_df = pd.DataFrame(
-        clf_prediction_prob[:, 1], columns=["PROB_DEFAULT"]
-    )
-    clf_thresh_predicted_default_status = (
-        clf_prediction_prob_df["PROB_DEFAULT"]
-        .apply(lambda x: 1 if x > threshold else 0)
-        .rename("PREDICT_DEFAULT_STATUS")
-    )
-    return pd.concat(
-        [
-            true_status.reset_index(drop=True),
-            clf_prediction_prob_df.reset_index(drop=True),
-            clf_thresh_predicted_default_status.reset_index(drop=True),
-            loan_amount.reset_index(drop=True),
-        ],
-        axis=1,
-    )

 from typing import Union
 import pandas as pd
 import streamlit as st
 import numpy as np
 from sklearn.metrics import (
     confusion_matrix,
 )
 from sklearn.linear_model import LogisticRegression
 from xgboost.sklearn import XGBClassifier
+from common.data import SplitDataset
+from common.util import (
     create_cross_validation_df,
     cross_validation_scores,
     get_df_trueStatus_probabilityDefault_threshStatus_loanAmount,
+)
+from common.views import (
     cross_validation_graph,
 )
+def make_evaluation_view(
     model_name_short: str,
     model_name_generic: str,
 ):
     def view(
+        clf_gbt_model: Union[XGBClassifier, LogisticRegression],
         split_dataset: SplitDataset,
         currency: str,
         prob_thresh_selected,
             train on each fold suggests performance will be stable."
         )
+        st.write(f"XGBoost cross validation test:")
         stcol_seed, stcol_eval_metric = st.columns(2)
             )
         cv_scores = cross_validation_scores(
+            clf_gbt_model,
             split_dataset.X_test,
             split_dataset.y_test,
             nfolds_score,
         df_trueStatus_probabilityDefault_threshStatus_loanAmount = (
             get_df_trueStatus_probabilityDefault_threshStatus_loanAmount(
+                clf_gbt_model,
                 split_dataset.X_test,
                 split_dataset.y_test,
                 prob_thresh_selected,
     return view
+decision_tree_evaluation_view = make_evaluation_view("gbt", "Decision Tree")
+logistic_evaluation_view = make_evaluation_view("lg", "Logistic Regression")

views/logistic.py ADDED Viewed

	@@ -0,0 +1,119 @@

+from common.data import SplitDataset
+import streamlit as st
+import pandas as pd
+import plotly.express as px
+from views.threshold import logistic_threshold_view
+from views.evaluation import logistic_evaluation_view
+from common.util import (
+    test_variables_logistic,
+    print_coeff_logistic,
+    model_probability_values_df,
+    apply_threshold_to_probability_values,
+)
+from common.views import (
+    streamlit_2columns_metrics_df,
+    streamlit_2columns_metrics_pct_df,
+)
+from views.typing import ModelView
+def logistic_view(split_dataset: SplitDataset, currency: str) -> ModelView:
+    # ### Test and create variables logically
+    st.header("Logistic Regression")
+    clf_logistic_model = test_variables_logistic(
+        split_dataset.X_train, split_dataset.y_train
+    )
+    st.metric(
+        label="# of Coefficients in Logistic Regression",
+        value=clf_logistic_model.n_features_in_,
+        delta=None,
+        delta_color="normal",
+    )
+    coef_dict = print_coeff_logistic(clf_logistic_model, split_dataset)
+    st.subheader("Logistic Regression Coefficient Values")
+    coef_dict_sorted = dict(
+        sorted(coef_dict.items(), key=lambda item: item[1], reverse=False)
+    )
+    data_items = coef_dict_sorted.items()
+    data_list = list(data_items)
+    df = pd.DataFrame(data_list, columns=["Coefficient", "Value"])
+    fig1 = px.bar(data_frame=df, x="Value", y="Coefficient", orientation="h")
+    fig1.update_layout(
+        title="Logistic Regression Coefficients",
+        xaxis_title="Value",
+        yaxis_title="Coefficient",
+    )
+    st.plotly_chart(fig1)
+    st.subheader("Classification Probability Threshold")
+    st.write(
+        """
+        The logistic regression model (obtained using training data) is applied on testing data to predict the loans probabilities of defaulting.\n
+        Probabilities of defaulting of the loans are compared to a probability threshold.\n
+        A loan is predicted to default if its predicted probability of defaulting is greater than the probability threshold.
+        """
+    )
+    threshold = st.slider(
+        label="Default Probability Threshold:",
+        min_value=0.0,
+        max_value=1.0,
+        value=0.7,
+        key="key_threshold",
+    )
+    clf_prediction_prob_df_log = model_probability_values_df(
+        clf_logistic_model,
+        split_dataset.X_test,
+    )
+    clf_thresh_predicted_default_status_user = (
+        apply_threshold_to_probability_values(
+            clf_prediction_prob_df_log,
+            threshold,
+        )
+    )
+    streamlit_2columns_metrics_df(
+        "# of Predicted Defaults",
+        "# of Predicted Non-Default",
+        clf_thresh_predicted_default_status_user,
+    )
+    streamlit_2columns_metrics_pct_df(
+        "% of Loans Predicted to Default",
+        "% of Loans Predicted not to Default",
+        clf_thresh_predicted_default_status_user,
+    )
+    threshold = logistic_threshold_view(clf_logistic_model, split_dataset)
+    df_trueStatus_probabilityDefault_threshStatus_loanAmount = (
+        logistic_evaluation_view(
+            clf_logistic_model,
+            split_dataset,
+            currency,
+            threshold.probability_threshold_selected,
+            threshold.predicted_default_status,
+        )
+    )
+    return ModelView(
+        model=clf_logistic_model,
+        trueStatus_probabilityDefault_threshStatus_loanAmount_df=df_trueStatus_probabilityDefault_threshStatus_loanAmount,
+        probability_threshold_selected=threshold.probability_threshold_selected,
+        predicted_default_status=threshold.predicted_default_status,
+        prediction_probability_df=threshold.prediction_probability_df,
+    )

src/models/util_model_comparison.py → views/model_comparison.py RENAMED Viewed

@@ -1,21 +1,16 @@
 from typing import OrderedDict
 import streamlit as st
 from sklearn.metrics import roc_auc_score
-from  features.util_build_features import SplitDataset
-from  visualization.graphs_settings import (
-    streamlit_chart_setting_height_width
-)
-from  visualization.graphs_test import (
     roc_auc_compare_n_models,
-    calibration_curve_report_commented_n
 )
-from  models.util_model_class import ModelClass
-def roc_auc_for_model(split_dataset: SplitDataset, model_view: ModelClass):
     roc_auc_model = roc_auc_score(
         split_dataset.y_test, model_view.predicted_default_status
     )
@@ -36,7 +31,7 @@ def roc_auc_for_model(split_dataset: SplitDataset, model_view: ModelClass):
 def model_comparison_view(
     split_dataset: SplitDataset,
-    model_views: OrderedDict[str, ModelClass],
 ):
     st.header("Model Comparison")
@@ -48,7 +43,7 @@ def model_comparison_view(
             f"Receiver Operating Characteristic (ROC) Curve - {model_name}"
         )
         st.markdown(
-            f'Area Under the Receiver Operating Characteristic Curve from prediction scores from {model_name} model is {roc_auc_model}.\n'
         )
         st.markdown(
             f'The score of {"{:.2f}".format(roc_auc_model)} is in the {roc_auc_lvl} ROC AUC score category.'
@@ -83,4 +78,4 @@ def model_comparison_view(
     fig2.set_size_inches(xsize_cal, ysize_cal)
-    st.pyplot(fig2)

 from typing import OrderedDict
 import streamlit as st
 from sklearn.metrics import roc_auc_score
+from common.data import SplitDataset
+from common.views import (
     roc_auc_compare_n_models,
+    streamlit_chart_setting_height_width,
+    calibration_curve_report_commented_n,
 )
+from views.typing import ModelView
+def roc_auc_for_model(split_dataset: SplitDataset, model_view: ModelView):
     roc_auc_model = roc_auc_score(
         split_dataset.y_test, model_view.predicted_default_status
     )
 def model_comparison_view(
     split_dataset: SplitDataset,
+    model_views: OrderedDict[str, ModelView],
 ):
     st.header("Model Comparison")
             f"Receiver Operating Characteristic (ROC) Curve - {model_name}"
         )
         st.markdown(
+            f'Area Under the Receiver Operating Characteristic Curve from prediction scores from "{model_name}" model is {roc_auc_model}.\n'
         )
         st.markdown(
             f'The score of {"{:.2f}".format(roc_auc_model)} is in the {roc_auc_lvl} ROC AUC score category.'
     fig2.set_size_inches(xsize_cal, ysize_cal)
+    st.pyplot(fig2.figure)

src/models/util_strategy_table.py → views/strategy_table.py RENAMED Viewed

@@ -2,12 +2,12 @@ from typing import OrderedDict
 import plotly.express as px
 import numpy as np
 import streamlit as st
-from  models.util_test import create_strategyTable_df
-from  models.util_model_class import ModelClass
 def strategy_table_view(
-    currency: str, model_views: OrderedDict[str, ModelClass]
 ):
     st.header("Strategy Table")
@@ -89,7 +89,7 @@ def strategy_table_view(
         )
         st.metric(
-            label='Total expected loss:',
             value=f"{currency} {tot_exp_loss:,.2f}",
             delta=None,
             delta_color="normal",

 import plotly.express as px
 import numpy as np
 import streamlit as st
+from common.util import create_strategyTable_df
+from views.typing import ModelView
 def strategy_table_view(
+    currency: str, model_views: OrderedDict[str, ModelView]
 ):
     st.header("Strategy Table")
         )
         st.metric(
+            label=f"Total expected loss:",
             value=f"{currency} {tot_exp_loss:,.2f}",
             delta=None,
             delta_color="normal",

views/threshold.py ADDED Viewed

	@@ -0,0 +1,272 @@

+from dataclasses import dataclass
+from typing import Union, cast
+import numpy as np
+import streamlit as st
+import plotly.express as px
+import pandas as pd
+from xgboost.sklearn import XGBClassifier
+from sklearn.linear_model import LogisticRegression
+from common.data import SplitDataset
+from common.util import (
+    model_probability_values_df,
+    apply_threshold_to_probability_values,
+    find_best_threshold_J_statistic,
+    default_status_per_threshold,
+    classification_report_per_threshold,
+    thresh_classification_report_recall_accuracy,
+)
+from common.views import (
+    streamlit_2columns_metrics_df,
+    streamlit_2columns_metrics_pct_df,
+)
+@dataclass(frozen=True)
+class Threshold:
+    probability_threshold_selected: float
+    predicted_default_status: pd.Series
+    prediction_probability_df: pd.DataFrame
+def make_threshold_view(
+    model_name_short: str,
+    model_name: str,
+):
+    def view(
+        clf_gbt_model: Union[XGBClassifier, LogisticRegression],
+        split_dataset: SplitDataset,
+    ) -> Threshold:
+        st.subheader("Classification Probability Threshold - User Defined")
+        st.write(
+            f"""
+            The {model_name} model (obtained using training data) is applied on testing data to predict the loans probabilities of defaulting.\n
+            Probabilities of defaulting of the loans are compared to a probability threshold.\n
+            A loan is predicted to default if its predicted probability of defaulting is greater than the probability threshold.
+            """
+        )
+        threshold_gbt_default = st.slider(
+            label="Default Probability Threshold:",
+            min_value=0.0,
+            max_value=1.0,
+            value=0.8,
+            key=f"threshold_{model_name_short}_default",
+        )
+        clf_prediction_prob_df_gbt = model_probability_values_df(
+            clf_gbt_model,
+            split_dataset.X_test,
+        )
+        clf_thresh_predicted_default_status_user_gbt = (
+            apply_threshold_to_probability_values(
+                clf_prediction_prob_df_gbt,
+                threshold_gbt_default,
+            )
+        )
+        streamlit_2columns_metrics_df(
+            "# of Predicted Defaults",
+            "# of Predicted Non-Default",
+            clf_thresh_predicted_default_status_user_gbt,
+        )
+        streamlit_2columns_metrics_pct_df(
+            "% of Loans Predicted to Default",
+            "% of Loans Predicted not to Default",
+            clf_thresh_predicted_default_status_user_gbt,
+        )
+        st.subheader("J Statistic Driven Classification Probability Threshold")
+        J_statistic_best_threshold = find_best_threshold_J_statistic(
+            split_dataset.y_test, clf_prediction_prob_df_gbt
+        )
+        st.metric(
+            label="Youden's J statistic calculated best threshold",
+            value=J_statistic_best_threshold,
+        )
+        clf_thresh_predicted_default_status_Jstatistic_gbt = (
+            apply_threshold_to_probability_values(
+                clf_prediction_prob_df_gbt,
+                J_statistic_best_threshold,
+            )
+        )
+        streamlit_2columns_metrics_df(
+            "# of Predicted Defaults",
+            "# of Predicted Non-Default",
+            clf_thresh_predicted_default_status_Jstatistic_gbt,
+        )
+        streamlit_2columns_metrics_pct_df(
+            "% of Loans Predicted to Default",
+            "% of Loans Predicted not to Default",
+            clf_thresh_predicted_default_status_Jstatistic_gbt,
+        )
+        st.subheader(
+            "Recall and Accuracy Tradeoff with given Probability Threshold"
+        )
+        # Steps
+        # Get list of thresholds
+        # Get default status per threshold
+        # Get classification report per threshold
+        # Get recall, nondef recall, and accuracy per threshold
+        threshold_list = np.arange(0, 1, 0.025).round(decimals=3).tolist()
+        threshold_default_status_list = default_status_per_threshold(
+            threshold_list, clf_prediction_prob_df_gbt["PROB_DEFAULT"]
+        )
+        thresh_classification_report_dict = (
+            classification_report_per_threshold(
+                threshold_list,
+                threshold_default_status_list,
+                split_dataset.y_test,
+            )
+        )
+        (
+            thresh_def_recalls_list,
+            thresh_nondef_recalls_list,
+            thresh_accs_list,
+        ) = thresh_classification_report_recall_accuracy(
+            thresh_classification_report_dict
+        )
+        namelist = [
+            "Default Recall",
+            "Non Default Recall",
+            "Accuracy",
+            "Threshold",
+        ]
+        df = pd.DataFrame(
+            [
+                thresh_def_recalls_list,
+                thresh_nondef_recalls_list,
+                thresh_accs_list,
+                threshold_list,
+            ],
+            index=namelist,
+        )
+        df = df.T
+        fig2 = px.line(
+            data_frame=df,
+            y=["Default Recall", "Non Default Recall", "Accuracy"],
+            x="Threshold",
+        )
+        fig2.update_layout(
+            title="Recall and Accuracy score Trade-off with Probability Threshold",
+            xaxis_title="Probability Threshold",
+            yaxis_title="Score",
+        )
+        fig2.update_yaxes(range=[0.0, 1.0])
+        st.plotly_chart(fig2)
+        st.subheader("Acceptance Rate Driven Probability Threshold")
+        # Steps
+        # Set acceptance rate
+        # Get default status per threshold
+        # Get classification report per threshold
+        # Get recall, nondef recall, and accuracy per threshold
+        acceptance_rate = (
+            st.slider(
+                label="% of loans accepted (acceptance rate):",
+                min_value=0,
+                max_value=100,
+                value=85,
+                key=f"acceptance_rate_{model_name_short}",
+                format="%f%%",
+            )
+            / 100
+        )
+        acc_rate_thresh_gbt = np.quantile(
+            clf_prediction_prob_df_gbt["PROB_DEFAULT"], acceptance_rate
+        )
+        st.write(
+            f"An acceptance rate of {acceptance_rate} results in probability threshold of {acc_rate_thresh_gbt}"
+        )
+        figa = px.histogram(clf_prediction_prob_df_gbt["PROB_DEFAULT"])
+        figa.update_layout(
+            title="Acceptance Rate Threshold vs. Loans Accepted",
+            xaxis_title="Acceptance Rate Threshold",
+            yaxis_title="Loans Accepted",
+        )
+        figa.update_traces(marker_line_width=1, marker_line_color="white")
+        figa.add_vline(
+            x=acc_rate_thresh_gbt,
+            line_width=3,
+            line_dash="solid",
+            line_color="red",
+        )
+        st.plotly_chart(figa)
+        clf_thresh_predicted_default_status_acceptance_gbt = (
+            apply_threshold_to_probability_values(
+                clf_prediction_prob_df_gbt,
+                acc_rate_thresh_gbt,
+            )
+        )
+        st.write()
+        st.subheader("Selected Probability Threshold")
+        options = [
+            "User Defined",
+            "J Statistic Driven",
+            "Acceptance Rate Driven",
+        ]
+        prob_thresh_option = st.radio(
+            label="Selected Probability Threshold",
+            options=options,
+            key=f"{model_name_short}_radio_thresh",
+        )
+        if prob_thresh_option == "User Defined":
+            prob_thresh_selected_gbt = threshold_gbt_default
+            predicted_default_status_gbt = (
+                clf_thresh_predicted_default_status_user_gbt
+            )
+        elif prob_thresh_option == "J Statistic Driven":
+            prob_thresh_selected_gbt = J_statistic_best_threshold
+            predicted_default_status_gbt = (
+                clf_thresh_predicted_default_status_Jstatistic_gbt
+            )
+        else:
+            prob_thresh_selected_gbt = acc_rate_thresh_gbt
+            predicted_default_status_gbt = (
+                clf_thresh_predicted_default_status_acceptance_gbt
+            )
+        st.write(
+            f"Selected probability threshold is {prob_thresh_selected_gbt}"
+        )
+        return Threshold(
+            probability_threshold_selected=cast(
+                float, prob_thresh_selected_gbt
+            ),
+            predicted_default_status=predicted_default_status_gbt,
+            prediction_probability_df=clf_prediction_prob_df_gbt,
+        )
+    return view
+decision_tree_threshold_view = make_threshold_view("gbt", "decision tree")
+logistic_threshold_view = make_threshold_view("lg", "logistic")

src/models/util_model_class.py → views/typing.py RENAMED Viewed

@@ -7,7 +7,7 @@ from sklearn.linear_model import LogisticRegression
 @dataclass(frozen=True)
-class ModelClass:
     model: Union[XGBClassifier, LogisticRegression]
     probability_threshold_selected: float
     predicted_default_status: pd.Series

 @dataclass(frozen=True)
+class ModelView:
     model: Union[XGBClassifier, LogisticRegression]
     probability_threshold_selected: float
     predicted_default_status: pd.Series