ayethuzar commited on
Commit
17267d5
·
unverified ·
1 Parent(s): 797c7fd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -19
README.md CHANGED
@@ -104,45 +104,46 @@ Before delving into the implementation, the notebook begins by importing essenti
104
  7. **xgboost and lightgbm:** Libraries for gradient boosting algorithms. XGBoost and LGBM are known for their excellent performance in regression tasks like house price prediction.
105
  8. **pickle:** A library for saving and loading Python objects. This is utilized for storing trained models for later use.
106
 
107
- Data Processing and Feature Selection:
 
108
  The notebook proceeds with comprehensive data processing and feature selection steps to prepare the data for training the machine learning models. The following steps are performed:
109
 
110
- Importing Data: The dataset, containing information about houses and their attributes, is imported using the pd.read_csv() function. The data is then divided into the training set (dataset) and test set (testset) for model evaluation.
 
 
111
 
112
- Exploratory Data Analysis: An initial exploration of the training dataset is conducted using the info() function, providing insights into the dataset's structure and missing values.
113
 
114
- Setting the Target Variable: The target variable, SalePrice, is separated from the training set and stored in a separate numpy array y. This is the variable we want the models to predict.
115
 
116
- Feature Selection based on Correlation: Columns with low correlation (< 0.4) with the target variable are dropped from both the training and test sets. Correlation values are calculated using the corr() function.
117
 
118
- Feature Selection based on Variance: Columns with low variance (< 1) are dropped from both the training and test sets. Variance values are calculated using the var() function.
119
 
120
- Feature Selection based on High Correlation: Columns with high correlation (> 0.5) with other features are dropped from both the training and test sets. These columns are identified using the correlation matrix and the corr() function.
121
 
122
- Handling Missing Data: Missing values in numerical columns (numerical) are filled with the value 0 based on domain knowledge. Missing values in categorical columns (categorical) are filled with the string 'None'.
123
 
124
- Label Encoding: Categorical data is encoded using the LabelEncoder from sklearn.preprocessing. This step converts categorical data into numerical format, making it suitable for model training.
125
 
126
- Final Feature Selection using Decision Trees (Random Forest): The notebook employs a Random Forest Regressor to identify the top 10 features that contribute most to predicting the target variable. The least important features are dropped from both the training and test sets.
127
 
128
- Normalizing Data: Finally, the data is normalized using Min-Max scaling to bring all feature values within the range of 0 to 1. This ensures that features with different scales do not dominate the model training process.
129
 
130
- Model Training and Evaluation:
131
  With the data fully processed and features selected, the training data (X) is prepared and used to train the XGBoost and LGBM models. The following steps are performed:
132
 
133
- XGBoost Model Training: The XGBoost model is initialized using the xgb.XGBRegressor class from the xgboost library. Hyperparameters such as the objective function, maximum depth of the trees, and the number of boosting rounds are set. The model is then trained on the training data (X) and the target variable (y) using the fit() method.
134
 
135
- Feature Importances: The feature importances are computed using the trained XGBoost model. The top 10 features that contribute most to the prediction are visualized using a bar plot. This provides valuable insights into which features play a significant role in determining house prices.
136
 
137
- Prediction on Test Data: The trained XGBoost model is used to predict house prices for the test set (testset). The predictions are saved for later comparison and evaluation.
138
 
139
- Final Submission: The predicted house prices for the test set are then saved in a format suitable for submission. This step prepares the model for real-world use and further evaluation.
140
 
141
- Data Splitting for Testing: Before training the XGBoost model, the training data (X) is split into training and testing sets using the train_test_split() function from sklearn.model_selection. The testing set will be used to evaluate the model's performance.
142
 
143
- Model Evaluation: The performance of the XGBoost model is evaluated using the testing data (X_test). The Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) are computed and printed to evaluate the model's accuracy in predicting house prices.
144
 
145
- SHAP (SHapley Additive exPlanations) Analysis for XGBoost:
146
  To gain a deeper understanding of the XGBoost model's predictions, SHAP values are computed. SHAP values provide insights into how each feature contributes to the model's predictions for individual data points. The following SHAP plots are generated to visualize the feature contributions and interactions:
147
 
148
  Waterfall Plot: A waterfall plot is created for the first observation in the training data. It shows how each feature contributes to the difference between the predicted price and the expected value. This plot helps identify the most influential features for a particular prediction.
 
104
  7. **xgboost and lightgbm:** Libraries for gradient boosting algorithms. XGBoost and LGBM are known for their excellent performance in regression tasks like house price prediction.
105
  8. **pickle:** A library for saving and loading Python objects. This is utilized for storing trained models for later use.
106
 
107
+ ### Data Processing and Feature Selection:
108
+
109
  The notebook proceeds with comprehensive data processing and feature selection steps to prepare the data for training the machine learning models. The following steps are performed:
110
 
111
+ 1. **Importing Data:** The dataset, containing information about houses and their attributes, is imported using the pd.read_csv() function. The data is then divided into the training set (dataset) and test set (testset) for model evaluation.
112
+
113
+ 2. **Exploratory Data Analysis:** An initial exploration of the training dataset is conducted using the info() function, providing insights into the dataset's structure and missing values.
114
 
115
+ 3. **Setting the Target Variable:** The target variable, SalePrice, is separated from the training set and stored in a separate numpy array y. This is the variable we want the models to predict.
116
 
117
+ 4. **Feature Selection based on Correlation:** Columns with low correlation (< 0.4) with the target variable are dropped from both the training and test sets. Correlation values are calculated using the corr() function.
118
 
119
+ 5. **Feature Selection based on Variance:** Columns with low variance (< 1) are dropped from both the training and test sets. Variance values are calculated using the var() function.
120
 
121
+ 6. **Feature Selection based on High Correlation:** Columns with high correlation (> 0.5) with other features are dropped from both the training and test sets. These columns are identified using the correlation matrix and the corr() function.
122
 
123
+ 7. **Handling Missing Data:** Missing values in numerical columns (numerical) are filled with the value 0 based on domain knowledge. Missing values in categorical columns (categorical) are filled with the string 'None'.
124
 
125
+ 8. **Label Encoding:** Categorical data is encoded using the LabelEncoder from sklearn.preprocessing. This step converts categorical data into numerical format, making it suitable for model training.
126
 
127
+ 9. **Final Feature Selection using Decision Trees (Random Forest):** The notebook employs a Random Forest Regressor to identify the top 10 features that contribute most to predicting the target variable. The least important features are dropped from both the training and test sets.
128
 
129
+ 10. **Normalizing Data:** Finally, the data is normalized using Min-Max scaling to bring all feature values within the range of 0 to 1. This ensures that features with different scales do not dominate the model training process.
130
 
131
+ ### Model Training and Evaluation:
132
 
 
133
  With the data fully processed and features selected, the training data (X) is prepared and used to train the XGBoost and LGBM models. The following steps are performed:
134
 
135
+ 1. **XGBoost Model Training:** The XGBoost model is initialized using the xgb.XGBRegressor class from the xgboost library. Hyperparameters such as the objective function, maximum depth of the trees, and the number of boosting rounds are set. The model is then trained on the training data (X) and the target variable (y) using the fit() method.
136
 
137
+ 2. **Feature Importances:** The feature importances are computed using the trained XGBoost model. The top 10 features that contribute most to the prediction are visualized using a bar plot. This provides valuable insights into which features play a significant role in determining house prices.
138
 
139
+ 3. **Prediction on Test Data:** The trained XGBoost model is used to predict house prices for the test set (testset). The predictions are saved for later comparison and evaluation.
140
 
141
+ 4. **Data Splitting for Testing:** Before training the XGBoost model, the training data (X) is split into training and testing sets using the train_test_split() function from sklearn.model_selection. The testing set will be used to evaluate the model's performance.
142
 
143
+ 5. **Model Evaluation:** The performance of the XGBoost model is evaluated using the testing data (X_test). The Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) are computed and printed to evaluate the model's accuracy in predicting house prices.
144
 
145
+ ### SHAP (SHapley Additive exPlanations) Analysis for XGBoost:
146
 
 
147
  To gain a deeper understanding of the XGBoost model's predictions, SHAP values are computed. SHAP values provide insights into how each feature contributes to the model's predictions for individual data points. The following SHAP plots are generated to visualize the feature contributions and interactions:
148
 
149
  Waterfall Plot: A waterfall plot is created for the first observation in the training data. It shows how each feature contributes to the difference between the predicted price and the expected value. This plot helps identify the most influential features for a particular prediction.