⏱️ Q1: How long did it take to solve the problem?

The solution was developed in approximately 5 hours (excluding data collection and model training phases).

🔍 Q2: Can you explain your solution approach?

The solution implements a multi-stage document classification pipeline:

1. Data Collection & Processing:

Dataset: 2500+ training URLs and 250+ test URLs
Implemented ThreadPooling with 20 workers for parallel processing
Reduced download time to ~40 minutes (vs. 3+ hours sequential)
Used PDFPlumber for robust text extraction

2. Model Development Pipeline:

Baseline Approach:
- TF-IDF vectorization for text representation
- Logistic Regression for initial classification
- Quick inference and resource-efficient

Advanced Approach:
- BERT-based architecture for deep learning
- Fine-tuned on construction document dataset
- Superior context understanding and accuracy

3. Evaluation Strategy:

Comprehensive metric suite (Precision, Recall, F1)
Special consideration for class imbalance
Comparative analysis between baseline and BERT

4. Deployment & Demo:

Streamlit-based interactive web interface
Real-time document classification
Comprehensive project documentation
Performance visualization and analytics

💡 Key implementation: The parallel processing implementation significantly reduced data preparation time, allowing for faster iteration and model experimentation. This, combined with the dual-model approach, provides both efficiency and accuracy in document classification.

🤖 Q3: Which models did you use and why?

Implemented baseline using TF-IDF and Logistic Regression and then used BERT-based model:

Baseline Model:

TF-IDF + Logistic Regression
Quick inference time
Resource-efficient

BERT Model:

Fine-tuned on 1800 samples text
Better context understanding
Better handling of complex documents

⚠️ Q4: What are the current limitations and potential improvements?

Current Implementation & Limitations:

~25% of dataset URLs were inaccessible
Used Threadpooling for parallel downloading of train and test documents

Proposed Improvements:

Use latest LLMs like GPT-4o, Claude 3.5 Sonnet etc with few shot prompting to speed up the development process
Optimize inference pipeline for faster processing using distilled models like DistilBERT, or the last BERT based model - ModernBERT to compare the performance
Add support for more document formats

📊 Q5: What is the model's performance on test data?

BERT Model Performance:

Category	Precision	Recall	F1-Score	Support
Cable	1.00	1.00	1.00	92
Fuses	0.95	1.00	0.98	42
Lighting	0.94	1.00	0.97	74
Others	1.00	0.92	0.96	83
Accuracy	0.98			291
Macro Avg	0.97	0.98	0.98	291
Weighted Avg	0.98	0.98	0.98	291

📈 Q6: Why did you choose these particular metrics?

Our metric selection was driven by the dataset characteristics:

Key Considerations:

Dataset has mild class imbalance (Imbalance Ratio: 2.36)
Need for balanced evaluation across all classes

Selected Metrics:

Precision: Critical for minimizing false positives
Recall: Important for catching all instances of each class
F1-Score: Provides balanced evaluation of both metrics
Weighted Average: Accounts for class imbalance