import streamlit as st import pandas as pd import plotly.express as px def main(): st.title("📚 Project Documentation") # Custom CSS for better styling st.markdown(""" """, unsafe_allow_html=True) # Q1: Development Timeline st.markdown("""

⏱️ Q1: How long did it take to solve the problem?

The solution was developed in approximately 5 hours (excluding data collection and model training phases).

""", unsafe_allow_html=True) # Q2: Solution Explanation st.markdown("""

🔍 Q2: Can you explain your solution approach?

The solution implements a multi-stage document classification pipeline:

1. Direct URL Text Approach:

Initially considered direct URL text extraction
Found limitations in accuracy and reliability

2. Baseline Approach (ML Model):

Implemented TF-IDF vectorization
Used Logistic Regression for classification
Provided quick and efficient results

3. (DL Model):

Utilized BERT-based model architecture
Fine-tuned on construction document dataset
Achieved superior accuracy and context understanding

""", unsafe_allow_html=True) # Q3: Model Selection st.markdown("""

🤖 Q3: Which models did you use and why?

Implemented baseline using TF-IDF and Logistic Regression and then used BERT-based model:

Baseline Model:

TF-IDF + Logistic Regression
Quick inference time
Resource-efficient

BERT Model:

Fine-tuned on 1800 samples text
Better context understanding
Better handling of complex documents

""", unsafe_allow_html=True) # Q4: Limitations and Improvements st.markdown("""

⚠️ Q4: What are the current limitations and potential improvements?

Current Implementation & Limitations:

~25% of dataset URLs were inaccessible
Used Threadpooling for parallel downloading of train and test documents

Proposed Improvements:

Use latest LLMs like GPT-4o, Claude 3.5 Sonnet etc with few shot prompting to speed up the development process
Optimize inference pipeline for faster processing using distilled models like DistilBERT, or the last BERT based model - ModernBERT to compare the performance
Add support for more document formats

""", unsafe_allow_html=True) # Q5: Model Performance st.markdown("""

📊 Q5: What is the model's performance on test data?

BERT Model Performance:

Category	Precision	Recall	F1-Score	Support
Cable	1.00	1.00	1.00	92
Fuses	0.95	1.00	0.98	42
Lighting	0.94	1.00	0.97	74
Others	1.00	0.92	0.96	83
Accuracy	0.98			291
Macro Avg	0.97	0.98	0.98	291
Weighted Avg	0.98	0.98	0.98	291

""", unsafe_allow_html=True) st.markdown("""

✨ Perfect performance (1.00) for Cable category
📈 High recall (1.00) across most categories
🎯 Overall accuracy of 98%
⚖️ Balanced performance across all metrics

""", unsafe_allow_html=True) # Q6: Metric Selection st.markdown("""

📈 Q6: Why did you choose these particular metrics?

Our metric selection was driven by the dataset characteristics:

Key Considerations:

Dataset has mild class imbalance (Imbalance Ratio: 2.36)
Need for balanced evaluation across all classes

Selected Metrics:

Precision: Critical for minimizing false positives
Recall: Important for catching all instances of each class
F1-Score: Provides balanced evaluation of both metrics
Weighted Average: Accounts for class imbalance

""", unsafe_allow_html=True) # Performance Visualization st.markdown("### 📊 Model Performance Comparison") metrics = { 'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'], 'Baseline': [0.85, 0.83, 0.84, 0.83], 'BERT': [0.98, 0.97, 0.98, 0.98] } df = pd.DataFrame(metrics) fig = px.bar( df, x='Metric', y=['Baseline', 'BERT'], barmode='group', title='Model Performance Comparison', color_discrete_sequence=['#2ecc71', '#3498db'], template='plotly_white' ) fig.update_layout( title_x=0.5, title_font_size=20, legend_title_text='Model Type', xaxis_title="Evaluation Metric", yaxis_title="Score", bargap=0.2, height=500 ) st.plotly_chart(fig, use_container_width=True) main()