Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -87,126 +87,6 @@ Please see Milestone4Documentation.md: https://github.com/aye-thuzar/CS634Projec
|
|
87 |
|
88 |
Here is the landing page for my app: https://sites.google.com/view/cs634-realestatehousepricepred/home
|
89 |
|
90 |
-
## Notebook:
|
91 |
-
|
92 |
-
Real estate pricing is a complex and crucial task that heavily relies on various factors influencing property values. In recent years, machine learning models have emerged as powerful tools for predicting house prices, allowing stakeholders to make informed decisions in the real estate market. This Python notebook presents a comprehensive approach to predicting house prices using feature selection techniques, XGBoost, and Light Gradient Boosting Machine (LGBM) models. By employing cutting-edge machine learning algorithms and interpretability techniques, this notebook aims to build accurate, reliable, and transparent models for house price prediction.
|
93 |
-
|
94 |
-
### Libraries Used:
|
95 |
-
|
96 |
-
Before delving into the implementation, the notebook begins by importing essential Python libraries, each serving a specific purpose throughout the analysis and model development:
|
97 |
-
|
98 |
-
1. **shap:** A library for explaining machine learning models. SHAP values provide insights into how features contribute to model predictions, enhancing interpretability.
|
99 |
-
|
100 |
-
2. **sklearn:** The popular machine learning library with tools for classification, regression, clustering, and more. It provides utilities for data preprocessing, model evaluation, and train-test splitting.
|
101 |
-
|
102 |
-
3. **optuna:** An optimization framework for hyperparameter tuning. Optuna efficiently searches the hyperparameter space to find the best set of hyperparameters for the models.
|
103 |
-
|
104 |
-
4. **math, numpy, and pandas:** Basic numerical and data manipulation libraries used for mathematical operations and data handling.
|
105 |
-
|
106 |
-
5. **matplotlib and seaborn:** Libraries for data visualization, creating insightful plots and charts for better understanding of the data.
|
107 |
-
|
108 |
-
6. **graphviz:** A library for visualizing decision trees. It helps in understanding the individual trees in the ensemble models.
|
109 |
-
|
110 |
-
7. **xgboost and lightgbm:** Libraries for gradient boosting algorithms. XGBoost and LGBM are known for their excellent performance in regression tasks like house price prediction.
|
111 |
-
|
112 |
-
8. **pickle:** A library for saving and loading Python objects. This is utilized for storing trained models for later use.
|
113 |
-
|
114 |
-
### Data Processing and Feature Selection:
|
115 |
-
|
116 |
-
The notebook proceeds with comprehensive data processing and feature selection steps to prepare the data for training the machine learning models. The following steps are performed:
|
117 |
-
|
118 |
-
1. **Importing Data:** The dataset, containing information about houses and their attributes, is imported using the pd.read_csv() function. The data is then divided into the training set (dataset) and test set (testset) for model evaluation.
|
119 |
-
|
120 |
-
2. **Exploratory Data Analysis:** An initial exploration of the training dataset is conducted using the info() function, providing insights into the dataset's structure and missing values.
|
121 |
-
|
122 |
-
3. **Setting the Target Variable:** The target variable, SalePrice, is separated from the training set and stored in a separate numpy array y. This is the variable we want the models to predict.
|
123 |
-
|
124 |
-
4. **Feature Selection based on Correlation:** Columns with low correlation (< 0.4) with the target variable are dropped from both the training and test sets. Correlation values are calculated using the corr() function.
|
125 |
-
|
126 |
-
5. **Feature Selection based on Variance:** Columns with low variance (< 1) are dropped from both the training and test sets. Variance values are calculated using the var() function.
|
127 |
-
|
128 |
-
6. **Feature Selection based on High Correlation:** Columns with high correlation (> 0.5) with other features are dropped from both the training and test sets. These columns are identified using the correlation matrix and the corr() function.
|
129 |
-
|
130 |
-
7. **Handling Missing Data:** Missing values in numerical columns (numerical) are filled with the value 0 based on domain knowledge. Missing values in categorical columns (categorical) are filled with the string 'None'.
|
131 |
-
|
132 |
-
8. **Label Encoding:** Categorical data is encoded using the LabelEncoder from sklearn.preprocessing. This step converts categorical data into numerical format, making it suitable for model training.
|
133 |
-
|
134 |
-
9. **Final Feature Selection using Decision Trees (Random Forest):** The notebook employs a Random Forest Regressor to identify the top 10 features that contribute most to predicting the target variable. The least important features are dropped from both the training and test sets.
|
135 |
-
|
136 |
-
10. **Normalizing Data:** Finally, the data is normalized using Min-Max scaling to bring all feature values within the range of 0 to 1. This ensures that features with different scales do not dominate the model training process.
|
137 |
-
|
138 |
-
### Model Training and Evaluation:
|
139 |
-
|
140 |
-
With the data fully processed and features selected, the training data (X) is prepared and used to train the XGBoost and LGBM models. The following steps are performed:
|
141 |
-
|
142 |
-
1. **XGBoost Model Training:** The XGBoost model is initialized using the xgb.XGBRegressor class from the xgboost library. Hyperparameters such as the objective function, maximum depth of the trees, and the number of boosting rounds are set. The model is then trained on the training data (X) and the target variable (y) using the fit() method.
|
143 |
-
|
144 |
-
2. **Feature Importances:** The feature importances are computed using the trained XGBoost model. The top 10 features that contribute most to the prediction are visualized using a bar plot. This provides valuable insights into which features play a significant role in determining house prices.
|
145 |
-
|
146 |
-
3. **Prediction on Test Data:** The trained XGBoost model is used to predict house prices for the test set (testset). The predictions are saved for later comparison and evaluation.
|
147 |
-
|
148 |
-
4. **Data Splitting for Testing:** Before training the XGBoost model, the training data (X) is split into training and testing sets using the train_test_split() function from sklearn.model_selection. The testing set will be used to evaluate the model's performance.
|
149 |
-
|
150 |
-
5. **Model Evaluation:** The performance of the XGBoost model is evaluated using the testing data (X_test). The Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) are computed and printed to evaluate the model's accuracy in predicting house prices.
|
151 |
-
|
152 |
-
### SHAP (SHapley Additive exPlanations) Analysis for XGBoost:
|
153 |
-
|
154 |
-
To gain a deeper understanding of the XGBoost model's predictions, SHAP values are computed. SHAP values provide insights into how each feature contributes to the model's predictions for individual data points. The following SHAP plots are generated to visualize the feature contributions and interactions:
|
155 |
-
|
156 |
-
1. **Waterfall Plot:** A waterfall plot is created for the first observation in the training data. It shows how each feature contributes to the difference between the predicted price and the expected value. This plot helps identify the most influential features for a particular prediction.
|
157 |
-
|
158 |
-
2. **Mean SHAP Value Plot:** This plot displays the mean SHAP values across all observations instead of positive and negative offsets. It helps identify the most important features in the model's predictions.
|
159 |
-
|
160 |
-
3. **Summary Plot:** The summary plot visualizes all SHAP values for each feature. It groups the values by feature and represents higher feature values in redder shades. This plot highlights important relationships between features and their impact on the predictions.
|
161 |
-
|
162 |
-
4. **Summary Plot with Interaction Values:** This summary plot shows the relationship between features and their SHAP interaction values. It provides additional insights into significant feature interactions.
|
163 |
-
|
164 |
-
5. **Dependence Plot:** The dependence plot illustrates the relationship between two features, GrLivArea and OverallQual, and their SHAP interaction values. It shows how the predicted price changes as the features' values change. This plot helps understand how individual feature values affect the model's predictions.
|
165 |
-
|
166 |
-
### Hyperparameter Tuning with Optuna for XGBoost:
|
167 |
-
|
168 |
-
After understanding the initial performance of the XGBoost model, the notebook proceeds with hyperparameter tuning using Optuna. Hyperparameter tuning is a crucial step in improving the model's performance by finding the best set of hyperparameters for the XGBoost model.
|
169 |
-
|
170 |
-
1. **Creating the Optuna Study:** An Optuna study is created using optuna.create_study(), and the direction parameter is set to 'minimize' as the objective is to minimize the Mean Squared Error (MSE). The study aims to find the best hyperparameters by exploring the hyperparameter space for a defined number of trials (n_trials).
|
171 |
-
|
172 |
-
2. **Hyperparameter Tuning for XGBoost using Optuna:** Optuna is used to perform hyperparameter tuning for the XGBoost model. The objective function is defined, which takes a set of hyperparameters as input and returns the MSE as the evaluation metric to minimize. Optuna then searches the hyperparameter space to find the best combination of hyperparameters that result in the lowest MSE.
|
173 |
-
|
174 |
-
3. **Optimized XGBoost Model:** After hyperparameter tuning, the best set of hyperparameters found by Optuna is used to create an optimized XGBoost model (xgb_optimized). This model is expected to perform better than the initial XGBoost model due to the fine-tuned hyperparameters.
|
175 |
-
|
176 |
-
4. **XGBoost Model Evaluation:** The performance of the optimized XGBoost model is evaluated using the testing data (X_test). The MAE, MSE, and RMSE scores are calculated and printed to assess the model's improved accuracy and predictions.
|
177 |
-
|
178 |
-
#### SHAP (SHapley Additive exPlanations) Analysis for Optimized XGBoost Model:
|
179 |
-
|
180 |
-
With the optimized XGBoost model, SHAP analysis is performed once again to gain deeper insights into its predictions and feature importances. The same SHAP plots as before are generated, revealing how the optimized model's predictions differ from the initial model.
|
181 |
-
|
182 |
-
#### LGBM Baseline Model:
|
183 |
-
|
184 |
-
Moving on, the notebook introduces a baseline model using Light Gradient Boosting Machine (LGBM). LGBM is known for its fast processing and performance, making it an excellent candidate for comparison with XGBoost. The baseline LGBM model (reg_lgbm_baseline) is trained on the training data (X_train and y_train), and its performance is evaluated using MAE, MSE, and RMSE scores.
|
185 |
-
|
186 |
-
#### SHAP (SHapley Additive exPlanations) Analysis for LGBM Baseline Model:
|
187 |
-
|
188 |
-
With the baseline LGBM model trained, SHAP analysis is conducted to interpret its predictions and understand the feature importances. The SHAP plots showcase how LGBM's predictions differ from XGBoost and which features have the most significant impact on its predictions.
|
189 |
-
|
190 |
-
#### Hyperparameter Tuning with Optuna for LGBM:
|
191 |
-
|
192 |
-
Similar to XGBoost, Optuna is utilized to perform hyperparameter tuning for the LGBM model. Optuna searches for the best combination of hyperparameters that minimize the RMSE on the validation data. The tuned LGBM model is expected to improve the baseline performance.
|
193 |
-
|
194 |
-
#### SHAP (SHapley Additive exPlanations) Analysis for Optimized LGBM Model:
|
195 |
-
|
196 |
-
After hyperparameter tuning, SHAP analysis is performed on the optimized LGBM model to gain insights into its predictions. The SHAP plots reveal how the optimized LGBM model's predictions differ from the baseline and how features' importances change.
|
197 |
-
|
198 |
-
#### Model Comparison:
|
199 |
-
|
200 |
-
The performance of the optimized XGBoost and LGBM models is compared to identify the best-performing model for house price prediction. Metrics like RMSE, MAE, and MSE are compared to evaluate the models' accuracy and reliability.
|
201 |
-
|
202 |
-
#### Model Pickling:
|
203 |
-
|
204 |
-
Finally, the trained models are saved using pickle to be used with Streamlit for deployment. Pickling allows the models to be easily loaded and utilized in real-world applications, making the model predictions readily available for end-users.
|
205 |
-
|
206 |
-
### Conclusion:
|
207 |
-
|
208 |
-
In conclusion, this Python notebook presents an approach to predicting house prices using feature selection techniques, XGBoost, and LGBM models. The notebook covers data preprocessing, feature selection, model training, hyperparameter tuning, model evaluation, and SHAP analysis. The SHAP plots provide valuable insights into the models' decision-making processes and highlight the most significant features impacting the predictions. By employing sophisticated machine learning algorithms and interpretability techniques, the notebook delivers accurate and transparent models for house price prediction. The optimized models achieved improved performance compared to the baseline, showcasing the effectiveness of hyperparameter tuning. The models' predictions can empower stakeholders to make informed decisions in the real estate market, benefiting buyers, sellers, and real estate professionals alike. With the models trained, hyperparameters optimized, and SHAP analysis conducted, the pickled models are now ready to be deployed in real-world applications, providing valuable predictions for house prices and enhancing decision-making processes in the real estate industry.
|
209 |
-
|
210 |
**References:**
|
211 |
|
212 |
https://towardsdatascience.com/analysing-interactions-with-shap-8c4a2bc11c2a
|
|
|
87 |
|
88 |
Here is the landing page for my app: https://sites.google.com/view/cs634-realestatehousepricepred/home
|
89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
**References:**
|
91 |
|
92 |
https://towardsdatascience.com/analysing-interactions-with-shap-8c4a2bc11c2a
|