AmelieSchreiber
/

esm2_t12_35M_lora_binding_sites_v2_cp1

@@ -38,13 +38,13 @@ This model was finetuned on ~549K protein sequences from the UniProt database. T
 the following test metrics:
 ```python
-({'accuracy': 0.9905461579981686,
   'precision': 0.7695765003685506,
   'recall': 0.9841352974610041,
   'f1': 0.8637307441810476,
   'auc': 0.9874413786006525,
   'mcc': 0.8658850560635515},
- {'accuracy': 0.9394282959813123,
   'precision': 0.3662722265170941,
   'recall': 0.8330231316088238,
   'f1': 0.5088208423175958,
@@ -52,6 +52,67 @@ the following test metrics:
   'mcc': 0.5283098562376193})
 ```
 The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
 We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided
 to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the

 the following test metrics:
 ```python
+Train: ({'accuracy': 0.9905461579981686,
   'precision': 0.7695765003685506,
   'recall': 0.9841352974610041,
   'f1': 0.8637307441810476,
   'auc': 0.9874413786006525,
   'mcc': 0.8658850560635515},
+Test: {'accuracy': 0.9394282959813123,
   'precision': 0.3662722265170941,
   'recall': 0.8330231316088238,
   'f1': 0.5088208423175958,
   'mcc': 0.5283098562376193})
 ```
+To analyze the train and test metrics, we will consider each metric individually and then offer a comprehensive view of the
+model’s performance. Let's start:
+### **1. Accuracy**
+- **Train**: 99.05%
+- **Test**: 93.94%
+The accuracy is quite high in both the training and test datasets, indicating that the model is correctly identifying the positive
+and negative classes most of the time.
+### **2. Precision**
+- **Train**: 76.96%
+- **Test**: 36.63%
+The precision, which measures the proportion of true positive predictions among all positive predictions, drops significantly in
+the test set. This suggests that the model might be identifying too many false positives when generalized to unseen data.
+### **3. Recall**
+- **Train**: 98.41%
+- **Test**: 83.30%
+The recall, which indicates the proportion of actual positives correctly identified, remains quite high in the test set, although
+lower than in the training set. This suggests the model is quite sensitive and is able to identify most of the positive cases.
+### **4. F1-Score**
+- **Train**: 86.37%
+- **Test**: 50.88%
+The F1-score is the harmonic mean of precision and recall. The significant drop in the F1-score from training to testing indicates
+that the balance between precision and recall has worsened in the test set, which is primarily due to the lower precision.
+### **5. AUC (Area Under the ROC Curve)**
+- **Train**: 98.74%
+- **Test**: 88.83%
+The AUC is high in both training and testing, but it decreases in the test set. A high AUC indicates that the model has good measure
+of separability and is able to distinguish between the positive and negative classes well.
+### **6. MCC (Matthews Correlation Coefficient)**
+- **Train**: 86.59%
+- **Test**: 52.83%
+MCC is a balanced metric that considers true and false positives and negatives. The decline in MCC from training to testing indicates
+a decrease in the quality of binary classifications.
+### **Overall Analysis**
+- **Overfitting**: The significant drop in metrics such as precision, F1-score, and MCC from training to test set suggests that the model might be overfitting to the training data, i.e., it may not generalize well to unseen data.
+- **High Recall, Low Precision**: The model has a high recall but low precision on the test set, indicating that it is identifying too many cases as positive, including those that are actually negative (false positives). This could be a reflection of a model that is biased towards predicting the positive class.
+- **Improvement Suggestions**:
+  - **Data Augmentation**: So, we might want to consider data augmentation strategies to make the model more robust.
+  - **Class Weights**: If there is a class imbalance in the dataset, adjusting the class weights during training might help.
+  - **Hyperparameter Tuning**: Experiment with different hyperparameters, including the learning rate, batch size, etc., to see if you can improve the model's performance on the test set.
+  - **Feature Engineering**: Consider revisiting the features used to train the model. Sometimes, introducing new features or removing irrelevant ones can help improve performance.
+In conclusion, while the model performs excellently on the training set, its performance drops in the test set, suggesting that there
+is room for improvement to make the model more generalizable to unseen data. It would be beneficial to look into strategies to reduce
+overfitting and improve precision without significantly sacrificing recall.
 The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
 We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided
 to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the