ayethuzar commited on
Commit
f1a8520
·
unverified ·
1 Parent(s): d07a3f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md CHANGED
@@ -4,8 +4,12 @@ Milestone-3 notebook: https://colab.research.google.com/drive/17-7A0RkGcwqcJw0Ic
4
 
5
  Hugging Face App:
6
 
 
 
7
  Results:
8
 
 
 
9
  XGBoost Model's RMSE: 28986 (Milestone-2)
10
 
11
  Baseline LGBM's RMSE: 26233
@@ -44,6 +48,37 @@ min_child_samples : 1
44
 
45
  ***********
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  Reference:
48
 
49
  https://github.com/adhok/streamlit_ames_housing_price_prediction_app/tree/main
 
4
 
5
  Hugging Face App:
6
 
7
+ ***********
8
+
9
  Results:
10
 
11
+ ***********
12
+
13
  XGBoost Model's RMSE: 28986 (Milestone-2)
14
 
15
  Baseline LGBM's RMSE: 26233
 
48
 
49
  ***********
50
 
51
+ Documentation
52
+
53
+ ***********
54
+
55
+ Dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
56
+
57
+ Data Processing and Feature Selection:
58
+
59
+ For the feature selection, I started by dropping columns with a low correlation (< 0.4) with SalePrice. I then dropped columns with low variances (< 1). After that, I checked the correlation matrix between columns to drop selected columns that have a correlation greater than 0.5 but with consideration for domain knowledge. After that, I checked for NAs in the numerical columns. Then, based on the result, I used domain knowledge to fill the NAs with appropriate values. In this case, I used 0 to fill the NAs as it was the most relevant value. As for the categorical NAs, they were replaced with ‘None’. Once, all the NAs were taken care of, I used LabelEncoder to encode the categorical values. I, then, checked for a correlation between columns and dropped them based on domain knowledge.
60
+
61
+ Here are the 10 features I selected:
62
+
63
+ 'OverallQual',
64
+ 'YearBuilt',
65
+ 'TotalBsmtSF',
66
+ 'GrLivArea',
67
+ 'MasVnrArea',
68
+ 'BsmtFinType1',
69
+ 'Neighborhood',
70
+ 'GarageType',
71
+ 'SaleCondition',
72
+ 'BsmtExposure'
73
+
74
+ All the attributes are encoded and normalized before splitting into train and test with 80% train and 20% test.
75
+
76
+ **Milestone 2:
77
+
78
+ For milestone 2, I ran an XGBoost Model with objective="reg:squarederror" and max_depth=3. The RMSE score is 28986.
79
+
80
+ **Milestone 3:
81
+
82
  Reference:
83
 
84
  https://github.com/adhok/streamlit_ames_housing_price_prediction_app/tree/main