AmelieSchreiber
commited on
Commit
•
d4048c3
1
Parent(s):
17b2397
Update README.md
Browse files
README.md
CHANGED
@@ -38,13 +38,13 @@ This model was finetuned on ~549K protein sequences from the UniProt database. T
|
|
38 |
the following test metrics:
|
39 |
|
40 |
```python
|
41 |
-
({'accuracy': 0.9905461579981686,
|
42 |
'precision': 0.7695765003685506,
|
43 |
'recall': 0.9841352974610041,
|
44 |
'f1': 0.8637307441810476,
|
45 |
'auc': 0.9874413786006525,
|
46 |
'mcc': 0.8658850560635515},
|
47 |
-
{'accuracy': 0.9394282959813123,
|
48 |
'precision': 0.3662722265170941,
|
49 |
'recall': 0.8330231316088238,
|
50 |
'f1': 0.5088208423175958,
|
@@ -52,6 +52,67 @@ the following test metrics:
|
|
52 |
'mcc': 0.5283098562376193})
|
53 |
```
|
54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
|
56 |
We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided
|
57 |
to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the
|
|
|
38 |
the following test metrics:
|
39 |
|
40 |
```python
|
41 |
+
Train: ({'accuracy': 0.9905461579981686,
|
42 |
'precision': 0.7695765003685506,
|
43 |
'recall': 0.9841352974610041,
|
44 |
'f1': 0.8637307441810476,
|
45 |
'auc': 0.9874413786006525,
|
46 |
'mcc': 0.8658850560635515},
|
47 |
+
Test: {'accuracy': 0.9394282959813123,
|
48 |
'precision': 0.3662722265170941,
|
49 |
'recall': 0.8330231316088238,
|
50 |
'f1': 0.5088208423175958,
|
|
|
52 |
'mcc': 0.5283098562376193})
|
53 |
```
|
54 |
|
55 |
+
To analyze the train and test metrics, we will consider each metric individually and then offer a comprehensive view of the
|
56 |
+
model’s performance. Let's start:
|
57 |
+
|
58 |
+
### **1. Accuracy**
|
59 |
+
- **Train**: 99.05%
|
60 |
+
- **Test**: 93.94%
|
61 |
+
|
62 |
+
The accuracy is quite high in both the training and test datasets, indicating that the model is correctly identifying the positive
|
63 |
+
and negative classes most of the time.
|
64 |
+
|
65 |
+
### **2. Precision**
|
66 |
+
- **Train**: 76.96%
|
67 |
+
- **Test**: 36.63%
|
68 |
+
|
69 |
+
The precision, which measures the proportion of true positive predictions among all positive predictions, drops significantly in
|
70 |
+
the test set. This suggests that the model might be identifying too many false positives when generalized to unseen data.
|
71 |
+
|
72 |
+
### **3. Recall**
|
73 |
+
- **Train**: 98.41%
|
74 |
+
- **Test**: 83.30%
|
75 |
+
|
76 |
+
The recall, which indicates the proportion of actual positives correctly identified, remains quite high in the test set, although
|
77 |
+
lower than in the training set. This suggests the model is quite sensitive and is able to identify most of the positive cases.
|
78 |
+
|
79 |
+
### **4. F1-Score**
|
80 |
+
- **Train**: 86.37%
|
81 |
+
- **Test**: 50.88%
|
82 |
+
|
83 |
+
The F1-score is the harmonic mean of precision and recall. The significant drop in the F1-score from training to testing indicates
|
84 |
+
that the balance between precision and recall has worsened in the test set, which is primarily due to the lower precision.
|
85 |
+
|
86 |
+
### **5. AUC (Area Under the ROC Curve)**
|
87 |
+
- **Train**: 98.74%
|
88 |
+
- **Test**: 88.83%
|
89 |
+
|
90 |
+
The AUC is high in both training and testing, but it decreases in the test set. A high AUC indicates that the model has good measure
|
91 |
+
of separability and is able to distinguish between the positive and negative classes well.
|
92 |
+
|
93 |
+
### **6. MCC (Matthews Correlation Coefficient)**
|
94 |
+
- **Train**: 86.59%
|
95 |
+
- **Test**: 52.83%
|
96 |
+
|
97 |
+
MCC is a balanced metric that considers true and false positives and negatives. The decline in MCC from training to testing indicates
|
98 |
+
a decrease in the quality of binary classifications.
|
99 |
+
|
100 |
+
### **Overall Analysis**
|
101 |
+
|
102 |
+
- **Overfitting**: The significant drop in metrics such as precision, F1-score, and MCC from training to test set suggests that the model might be overfitting to the training data, i.e., it may not generalize well to unseen data.
|
103 |
+
|
104 |
+
- **High Recall, Low Precision**: The model has a high recall but low precision on the test set, indicating that it is identifying too many cases as positive, including those that are actually negative (false positives). This could be a reflection of a model that is biased towards predicting the positive class.
|
105 |
+
|
106 |
+
- **Improvement Suggestions**:
|
107 |
+
- **Data Augmentation**: So, we might want to consider data augmentation strategies to make the model more robust.
|
108 |
+
- **Class Weights**: If there is a class imbalance in the dataset, adjusting the class weights during training might help.
|
109 |
+
- **Hyperparameter Tuning**: Experiment with different hyperparameters, including the learning rate, batch size, etc., to see if you can improve the model's performance on the test set.
|
110 |
+
- **Feature Engineering**: Consider revisiting the features used to train the model. Sometimes, introducing new features or removing irrelevant ones can help improve performance.
|
111 |
+
|
112 |
+
In conclusion, while the model performs excellently on the training set, its performance drops in the test set, suggesting that there
|
113 |
+
is room for improvement to make the model more generalizable to unseen data. It would be beneficial to look into strategies to reduce
|
114 |
+
overfitting and improve precision without significantly sacrificing recall.
|
115 |
+
|
116 |
The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
|
117 |
We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided
|
118 |
to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the
|