File size: 3,869 Bytes
37b5724
4bb40bd
ed508e1
37b5724
4bb40bd
37b5724
 
 
 
 
 
87609f5
53393c7
03f261a
7800474
cd278ae
6b7e6bd
d81a187
 
f1a8520
 
cba8d18
5afa440
f1a8520
 
23831aa
74184b2
5afa440
74184b2
5afa440
 
74184b2
 
4ba239c
5afa440
4ba239c
5afa440
4ba239c
5afa440
4ba239c
5afa440
eaa59b4
5afa440
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7800474
ba10016
 
8d9e190
f1a8520
 
 
 
 
5c4b7f5
f1a8520
 
 
 
 
34b64a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cba8d18
f1a8520
 
 
eaa59b4
f1a8520
5457fd4
f1a8520
eaa59b4
 
5457fd4
f1a8520
5c4b7f5
5b2fe2a
 
 
 
 
 
 
 
 
 
 
 
 
 
7800474
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: HousePricePredictionApp
emoji: 🏠
colorFrom: pink
colorTo: yellow
sdk: streamlit
sdk_version: 1.21.0
app_file: app.py
pinned: false
---

# CS634Project

Milestone-3 notebook: [[https://colab.research.google.com/drive/17-7A0RkGcwqcJw0IcSvkniDmhbn5SuXe]](https://github.com/aye-thuzar/CS634Project/blob/milestone-3/CS634Project_Milestone3_AyeThuzar.ipynb)(https://colab.research.google.com/drive/1BeoZ4Dxhgd6OcUwPhk6rKCeFnDFMUCmt#scrollTo=TZ4Ci-YXOSl6)

Hugging Face App: https://huggingface.co/spaces/ayethuzar/HousePricePredictionApp

App Demonstration Video: 

***********

Results

***********

XGBoost Model's RMSE: 28986  (Milestone-2)

Baseline LGBM's RMSE: 26233

Optuna optimized LGBM's RMSE: 13799.282803291926

***********

Hyperparameter Tuning with Optuna

************

Total number of trials:  120

Best RMSE score on validation data: 12338.665498601415

**Best params:**

boosting_type :	 goss

reg_alpha :	 3.9731274536451826

reg_lambda :	 0.8825276525195174

colsample_bytree :	 1.0

subsample :	 1.0

learning_rate :	 0.05

max_depth :	 6

num_leaves :	 48

min_child_samples :	 1

***********

## Documentation for Milestone 4

***********

Dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

**Data Processing and Feature Selection:**

For the feature selection, I started by dropping columns with a low correlation (< 0.4) with SalePrice. I then dropped columns with low variances (< 1). After that, I checked the correlation matrix between columns to drop selected columns that have a correlation greater than 0.5 but with consideration for domain knowledge. After that, I checked for NAs in the numerical columns. Then, based on the result, I used domain knowledge to fill the NAs with appropriate values. In this case, I used 0 to fill the NAs as it was the most relevant value. As for the categorical NAs, they were replaced with ‘None’. Once, all the NAs were taken care of, I used LabelEncoder to encode the categorical values. I, then, checked for a correlation between columns and dropped them based on domain knowledge.

Here are the 10 features I selected:

 'OverallQual': Overall material and finish quality
 
 'YearBuilt': Original construction date
 
 'TotalBsmtSF': Total square feet of basement area
 
 'GrLivArea': Above grade (ground) living area square feet
 
 'MasVnrArea': Masonry veneer area in square feet
 
 'BsmtFinType1': Quality of basement finished area
 
 'Neighborhood': Physical locations within Ames city limits
 
 'GarageType': Garage location
 
 'SaleCondition': Condition of sale
 
 'BsmtExposure': Walkout or garden-level basement walls

All the attributes are encoded and normalized before splitting into train and test with 80% train and 20% test.

**Milestone 2:**

For milestone 2, I used an XGBoost Model with objective="reg:squarederror" and max_depth=3. The RMSE score is 28986.

**Milestone 3:**

For milestone 3, I used light gradient boosting machine (LGBM) with default parameters for baseline and hyperparameter-tuned with Optuna for the optimized model. The results are stated at the beginning of my readme file.

**References:**

https://towardsdatascience.com/analysing-interactions-with-shap-8c4a2bc11c2a

https://towardsdatascience.com/introduction-to-shap-with-python-d27edc23c454

https://www.aidancooper.co.uk/a-non-technical-guide-to-interpreting-shap-analyses/

https://www.kaggle.com/code/rnepal2/lightgbm-optuna-housing-prices-regression/notebook

https://www.kaggle.com/code/rnepal2/lightgbm-optuna-housing-prices-regression/notebook

https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/

https://towardsdatascience.com/why-is-everyone-at-kaggle-obsessed-with-optuna-for-hyperparameter-tuning-7608fdca337c

https://github.com/adhok/streamlit_ames_housing_price_prediction_app/tree/main