import streamlit as st
import pandas as pd
import plotly.express as px
def main():
st.title("📚 Project Documentation")
# Custom CSS for better styling
st.markdown("""
""", unsafe_allow_html=True)
# Q1: Development Timeline
st.markdown("""
⏱️ Q1: How long did it take to solve the problem?
The solution was developed in approximately 5 hours (excluding data collection and model training phases).
""", unsafe_allow_html=True)
# Q2: Solution Explanation
# Q2: Solution Explanation
st.markdown("""
🔍 Q2: Can you explain your solution approach?
The solution implements a multi-stage document classification pipeline:
1. Data Collection & Processing:
- Dataset: 2500+ training URLs and 250+ test URLs
- Implemented ThreadPooling with 20 workers for parallel processing
- Reduced download time to ~40 minutes (vs. 3+ hours sequential)
- Used PDFPlumber for robust text extraction
2. Model Development Pipeline:
- Baseline Approach:
- TF-IDF vectorization for text representation
- Logistic Regression for initial classification
- Quick inference and resource-efficient
- Advanced Approach:
- BERT-based architecture for deep learning
- Fine-tuned on construction document dataset
- Superior context understanding and accuracy
3. Evaluation Strategy:
- Comprehensive metric suite (Precision, Recall, F1)
- Special consideration for class imbalance
- Comparative analysis between baseline and BERT
4. Deployment & Demo:
- Streamlit-based interactive web interface
- Real-time document classification
- Comprehensive project documentation
- Performance visualization and analytics
💡 Key implementation: The parallel processing implementation significantly reduced data preparation time,
allowing for faster iteration and model experimentation. This, combined with the dual-model approach,
provides both efficiency and accuracy in document classification.
""", unsafe_allow_html=True)
# Q3: Model Selection
st.markdown("""
🤖 Q3: Which models did you use and why?
Implemented baseline using TF-IDF and Logistic Regression and then used BERT-based model:
Baseline Model:
- TF-IDF + Logistic Regression
- Quick inference time
- Resource-efficient
BERT Model:
- Fine-tuned on 1800 samples text
- Better context understanding
- Better handling of complex documents
""", unsafe_allow_html=True)
# Q4: Limitations and Improvements
st.markdown("""
⚠️ Q4: What are the current limitations and potential improvements?
Current Implementation & Limitations:
- ~25% of dataset URLs were inaccessible
- Used Threadpooling for parallel downloading of train and test documents
Proposed Improvements:
- Use latest LLMs like GPT-4o, Claude 3.5 Sonnet etc with few shot prompting to speed up the development process
- Optimize inference pipeline for faster processing using distilled models like DistilBERT, or the last BERT based model - ModernBERT to compare the performance
- Add support for more document formats
""", unsafe_allow_html=True)
# Q5: Model Performance
st.markdown("""
📊 Q5: What is the model's performance on test data?
BERT Model Performance:
Category |
Precision |
Recall |
F1-Score |
Support |
Cable |
1.00 |
1.00 |
1.00 |
92 |
Fuses |
0.95 |
1.00 |
0.98 |
42 |
Lighting |
0.94 |
1.00 |
0.97 |
74 |
Others |
1.00 |
0.92 |
0.96 |
83 |
Accuracy |
0.98 |
291 |
Macro Avg |
0.97 |
0.98 |
0.98 |
291 |
Weighted Avg |
0.98 |
0.98 |
0.98 |
291 |
""", unsafe_allow_html=True)
st.markdown("""
✨ Perfect performance (1.00) for Cable category
📈 High recall (1.00) across most categories
🎯 Overall accuracy of 98%
⚖️ Balanced performance across all metrics
""", unsafe_allow_html=True)
# Q6: Metric Selection
st.markdown("""
📈 Q6: Why did you choose these particular metrics?
Our metric selection was driven by the dataset characteristics:
Key Considerations:
- Dataset has mild class imbalance (Imbalance Ratio: 2.36)
- Need for balanced evaluation across all classes
Selected Metrics:
- Precision: Critical for minimizing false positives
- Recall: Important for catching all instances of each class
- F1-Score: Provides balanced evaluation of both metrics
- Weighted Average: Accounts for class imbalance
""", unsafe_allow_html=True)
# Performance Visualization
st.markdown("### 📊 Model Performance Comparison")
metrics = {
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
'Baseline': [0.85, 0.83, 0.84, 0.83],
'BERT': [0.98, 0.97, 0.98, 0.98]
}
df = pd.DataFrame(metrics)
fig = px.bar(
df,
x='Metric',
y=['Baseline', 'BERT'],
barmode='group',
title='Model Performance Comparison',
color_discrete_sequence=['#2ecc71', '#3498db'],
template='plotly_white'
)
fig.update_layout(
title_x=0.5,
title_font_size=20,
legend_title_text='Model Type',
xaxis_title="Evaluation Metric",
yaxis_title="Score",
bargap=0.2,
height=500
)
st.plotly_chart(fig, use_container_width=True)
main()