Spaces:
Sleeping
Sleeping
import streamlit as st | |
import pandas as pd | |
import plotly.express as px | |
def main(): | |
st.title("π Project Documentation") | |
# Custom CSS for better styling | |
st.markdown(""" | |
<style> | |
.question-card { | |
background-color: #f8f9fa; | |
padding: 20px; | |
border-radius: 10px; | |
border-left: 5px solid #1f77b4; | |
margin: 20px 0; | |
} | |
.question { | |
color: #1f77b4; | |
font-size: 1.2em; | |
font-weight: bold; | |
margin-bottom: 15px; | |
} | |
.answer { | |
color: #2c3e50; | |
line-height: 1.6; | |
} | |
</style> | |
""", unsafe_allow_html=True) | |
# Q1: Development Timeline | |
st.markdown(""" | |
<div class="question-card"> | |
<div class="question">β±οΈ Q1: How long did it take to solve the problem?</div> | |
<div class="answer"> | |
The solution was developed in approximately <b>5 hours</b> (excluding data collection and model training phases). | |
</div> | |
</div> | |
""", unsafe_allow_html=True) | |
# Q2: Solution Explanation | |
st.markdown(""" | |
<div class="question-card"> | |
<div class="question">π Q2: Can you explain your solution approach?</div> | |
<div class="answer"> | |
The solution implements a multi-stage document classification pipeline: | |
<br><br> | |
<b>1. Direct URL Text Approach:</b> | |
<ul> | |
<li>Initially considered direct URL text extraction</li> | |
<li>Found limitations in accuracy and reliability</li> | |
</ul> | |
<br> | |
<b>2. Baseline Approach (ML Model):</b> | |
<ul> | |
<li>Implemented TF-IDF vectorization</li> | |
<li>Used Logistic Regression for classification</li> | |
<li>Provided quick and efficient results</li> | |
</ul> | |
<br> | |
<b>3. (DL Model):</b> | |
<ul> | |
<li>Utilized BERT-based model architecture</li> | |
<li>Fine-tuned on construction document dataset</li> | |
<li>Achieved superior accuracy and context understanding</li> | |
</ul> | |
</div> | |
</div> | |
""", unsafe_allow_html=True) | |
# Q3: Model Selection | |
st.markdown(""" | |
<div class="question-card"> | |
<div class="question">π€ Q3: Which models did you use and why?</div> | |
<div class="answer"> | |
Implemented baseline using TF-IDF and Logistic Regression and then used BERT-based model: | |
<br><br> | |
<b>Baseline Model:</b> | |
<ul> | |
<li>TF-IDF + Logistic Regression</li> | |
<li>Quick inference time</li> | |
<li>Resource-efficient</li> | |
</ul> | |
<br> | |
<b>BERT Model:</b> | |
<ul> | |
<li>Fine-tuned on 1800 samples text</li> | |
<li>Better context understanding</li> | |
<li>Better handling of complex documents</li> | |
</ul> | |
</div> | |
</div> | |
""", unsafe_allow_html=True) | |
# Q4: Limitations and Improvements | |
st.markdown(""" | |
<div class="question-card"> | |
<div class="question">β οΈ Q4: What are the current limitations and potential improvements?</div> | |
<div class="answer"> | |
<b>Current Implementation & Limitations:</b> | |
<ul> | |
<li>~25% of dataset URLs were inaccessible</li> | |
<li>Used Threadpooling for parallel downloading of train and test documents</li> | |
</ul> | |
<br> | |
<b>Proposed Improvements:</b> | |
<ul> | |
<li>Use latest LLMs like GPT-4o, Claude 3.5 Sonnet etc with few shot prompting to speed up the development process</li> | |
<li>Optimize inference pipeline for faster processing using distilled models like DistilBERT, or the last BERT based model - ModernBERT to compare the performance</li> | |
<li>Add support for more document formats</li> | |
</ul> | |
</div> | |
</div> | |
""", unsafe_allow_html=True) | |
# Q5: Model Performance | |
st.markdown(""" | |
<div class="question-card"> | |
<div class="question">π Q5: What is the model's performance on test data?</div> | |
<div class="answer"> | |
<b>BERT Model Performance:</b> | |
<br><br> | |
<div style="overflow-x: auto;"> | |
<table style=" | |
width: 100%; | |
border-collapse: collapse; | |
margin: 20px 0; | |
font-size: 0.9em; | |
font-family: sans-serif; | |
box-shadow: 0 0 20px rgba(0, 0, 0, 0.15); | |
border-radius: 5px; | |
"> | |
<thead> | |
<tr style=" | |
background-color: #1f77b4; | |
color: white; | |
text-align: left; | |
"> | |
<th style="padding: 12px 15px;">Category</th> | |
<th style="padding: 12px 15px;">Precision</th> | |
<th style="padding: 12px 15px;">Recall</th> | |
<th style="padding: 12px 15px;">F1-Score</th> | |
<th style="padding: 12px 15px;">Support</th> | |
</tr> | |
</thead> | |
<tbody> | |
<tr style="border-bottom: 1px solid #dddddd;"> | |
<td style="padding: 12px 15px;"><b>Cable</b></td> | |
<td style="padding: 12px 15px;">1.00</td> | |
<td style="padding: 12px 15px;">1.00</td> | |
<td style="padding: 12px 15px;">1.00</td> | |
<td style="padding: 12px 15px;">92</td> | |
</tr> | |
<tr style="border-bottom: 1px solid #dddddd; background-color: #f3f3f3;"> | |
<td style="padding: 12px 15px;"><b>Fuses</b></td> | |
<td style="padding: 12px 15px;">0.95</td> | |
<td style="padding: 12px 15px;">1.00</td> | |
<td style="padding: 12px 15px;">0.98</td> | |
<td style="padding: 12px 15px;">42</td> | |
</tr> | |
<tr style="border-bottom: 1px solid #dddddd;"> | |
<td style="padding: 12px 15px;"><b>Lighting</b></td> | |
<td style="padding: 12px 15px;">0.94</td> | |
<td style="padding: 12px 15px;">1.00</td> | |
<td style="padding: 12px 15px;">0.97</td> | |
<td style="padding: 12px 15px;">74</td> | |
</tr> | |
<tr style="border-bottom: 1px solid #dddddd; background-color: #f3f3f3;"> | |
<td style="padding: 12px 15px;"><b>Others</b></td> | |
<td style="padding: 12px 15px;">1.00</td> | |
<td style="padding: 12px 15px;">0.92</td> | |
<td style="padding: 12px 15px;">0.96</td> | |
<td style="padding: 12px 15px;">83</td> | |
</tr> | |
</tbody> | |
<tfoot> | |
<tr style="background-color: #f8f9fa; font-weight: bold; border-top: 2px solid #dddddd;"> | |
<td style="padding: 12px 15px;">Accuracy</td> | |
<td style="padding: 12px 15px;" colspan="3">0.98</td> | |
<td style="padding: 12px 15px;">291</td> | |
</tr> | |
<tr style="background-color: #f8f9fa; color: #666;"> | |
<td style="padding: 12px 15px;">Macro Avg</td> | |
<td style="padding: 12px 15px;">0.97</td> | |
<td style="padding: 12px 15px;">0.98</td> | |
<td style="padding: 12px 15px;">0.98</td> | |
<td style="padding: 12px 15px;">291</td> | |
</tr> | |
<tr style="background-color: #f8f9fa; color: #666;"> | |
<td style="padding: 12px 15px;">Weighted Avg</td> | |
<td style="padding: 12px 15px;">0.98</td> | |
<td style="padding: 12px 15px;">0.98</td> | |
<td style="padding: 12px 15px;">0.98</td> | |
<td style="padding: 12px 15px;">291</td> | |
</tr> | |
</tfoot> | |
</table> | |
</div> | |
</div> | |
</div> | |
""", unsafe_allow_html=True) | |
st.markdown(""" | |
<div style=' | |
background-color: #f8f9fa; | |
padding: 20px; | |
border-radius: 10px; | |
border-left: 5px solid #1f77b4; | |
margin: 20px 0; | |
'> | |
β¨ Perfect performance (1.00) for Cable category<br> | |
π High recall (1.00) across most categories<br> | |
π― Overall accuracy of 98%<br> | |
βοΈ Balanced performance across all metrics | |
</div> | |
""", unsafe_allow_html=True) | |
# Q6: Metric Selection | |
st.markdown(""" | |
<div class="question-card"> | |
<div class="question">π Q6: Why did you choose these particular metrics?</div> | |
<div class="answer"> | |
Our metric selection was driven by the dataset characteristics: | |
<br><br> | |
<b>Key Considerations:</b> | |
<ul> | |
<li>Dataset has mild class imbalance (Imbalance Ratio: 2.36)</li> | |
<li>Need for balanced evaluation across all classes</li> | |
</ul> | |
<br> | |
<b>Selected Metrics:</b> | |
<ul> | |
<li><b>Precision:</b> Critical for minimizing false positives</li> | |
<li><b>Recall:</b> Important for catching all instances of each class</li> | |
<li><b>F1-Score:</b> Provides balanced evaluation of both metrics</li> | |
<li><b>Weighted Average:</b> Accounts for class imbalance</li> | |
</ul> | |
</div> | |
</div> | |
""", unsafe_allow_html=True) | |
# Performance Visualization | |
st.markdown("### π Model Performance Comparison") | |
metrics = { | |
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'], | |
'Baseline': [0.85, 0.83, 0.84, 0.83], | |
'BERT': [0.98, 0.97, 0.98, 0.98] | |
} | |
df = pd.DataFrame(metrics) | |
fig = px.bar( | |
df, | |
x='Metric', | |
y=['Baseline', 'BERT'], | |
barmode='group', | |
title='Model Performance Comparison', | |
color_discrete_sequence=['#2ecc71', '#3498db'], | |
template='plotly_white' | |
) | |
fig.update_layout( | |
title_x=0.5, | |
title_font_size=20, | |
legend_title_text='Model Type', | |
xaxis_title="Evaluation Metric", | |
yaxis_title="Score", | |
bargap=0.2, | |
height=500 | |
) | |
st.plotly_chart(fig, use_container_width=True) | |
main() |