import streamlit as st
import pandas as pd
import plotly.express as px
def main():
st.title("📚 Project Documentation")
# Custom CSS for better styling
st.markdown("""
""", unsafe_allow_html=True)
# Q1: Development Timeline
st.markdown("""
⏱️ Q1: How long did it take to solve the problem?
The solution was developed in approximately 5 hours (excluding data collection and model training phases).
""", unsafe_allow_html=True)
# Q2: Solution Explanation
st.markdown("""
🔍 Q2: Can you explain your solution approach?
The solution implements a multi-stage document classification pipeline:
1. Direct URL Text Approach:
- Initially considered direct URL text extraction
- Found limitations in accuracy and reliability
2. Baseline Approach (ML Model):
- Implemented TF-IDF vectorization
- Used Logistic Regression for classification
- Provided quick and efficient results
3. (DL Model):
- Utilized BERT-based model architecture
- Fine-tuned on construction document dataset
- Achieved superior accuracy and context understanding
""", unsafe_allow_html=True)
# Q3: Model Selection
st.markdown("""
🤖 Q3: Which models did you use and why?
Implemented baseline using TF-IDF and Logistic Regression and then used BERT-based model:
Baseline Model:
- TF-IDF + Logistic Regression
- Quick inference time
- Resource-efficient
BERT Model:
- Fine-tuned on 1800 samples text
- Better context understanding
- Better handling of complex documents
""", unsafe_allow_html=True)
# Q4: Limitations and Improvements
st.markdown("""
⚠️ Q4: What are the current limitations and potential improvements?
Current Implementation & Limitations:
- ~25% of dataset URLs were inaccessible
- Used Threadpooling for parallel downloading of train and test documents
Proposed Improvements:
- Use latest LLMs like GPT-4o, Claude 3.5 Sonnet etc with few shot prompting to speed up the development process
- Optimize inference pipeline for faster processing using distilled models like DistilBERT, or the last BERT based model - ModernBERT to compare the performance
- Add support for more document formats
""", unsafe_allow_html=True)
# Q5: Model Performance
st.markdown("""
📊 Q5: What is the model's performance on test data?
BERT Model Performance:
Category |
Precision |
Recall |
F1-Score |
Support |
Cable |
1.00 |
1.00 |
1.00 |
92 |
Fuses |
0.95 |
1.00 |
0.98 |
42 |
Lighting |
0.94 |
1.00 |
0.97 |
74 |
Others |
1.00 |
0.92 |
0.96 |
83 |
Accuracy |
0.98 |
291 |
Macro Avg |
0.97 |
0.98 |
0.98 |
291 |
Weighted Avg |
0.98 |
0.98 |
0.98 |
291 |
""", unsafe_allow_html=True)
st.markdown("""
✨ Perfect performance (1.00) for Cable category
📈 High recall (1.00) across most categories
🎯 Overall accuracy of 98%
⚖️ Balanced performance across all metrics
""", unsafe_allow_html=True)
# Q6: Metric Selection
st.markdown("""
📈 Q6: Why did you choose these particular metrics?
Our metric selection was driven by the dataset characteristics:
Key Considerations:
- Dataset has mild class imbalance (Imbalance Ratio: 2.36)
- Need for balanced evaluation across all classes
Selected Metrics:
- Precision: Critical for minimizing false positives
- Recall: Important for catching all instances of each class
- F1-Score: Provides balanced evaluation of both metrics
- Weighted Average: Accounts for class imbalance
""", unsafe_allow_html=True)
# Performance Visualization
st.markdown("### 📊 Model Performance Comparison")
metrics = {
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
'Baseline': [0.85, 0.83, 0.84, 0.83],
'BERT': [0.98, 0.97, 0.98, 0.98]
}
df = pd.DataFrame(metrics)
fig = px.bar(
df,
x='Metric',
y=['Baseline', 'BERT'],
barmode='group',
title='Model Performance Comparison',
color_discrete_sequence=['#2ecc71', '#3498db'],
template='plotly_white'
)
fig.update_layout(
title_x=0.5,
title_font_size=20,
legend_title_text='Model Type',
xaxis_title="Evaluation Metric",
yaxis_title="Score",
bargap=0.2,
height=500
)
st.plotly_chart(fig, use_container_width=True)
main()