Report for JiaqiLee/imdb-finetuned-bert-base-uncased
#95
by
giskard-bot
- opened
Hi Team,
This is a report from Giskard Bot Scan 🐢.
We have identified 10 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default
, split validation
).
👉Robustness issues (2)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Robustness | major 🔴 | — | Fail rate = 0.125 | Add typos | 100/800 tested samples (12.5%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.text | Add typos(text) | Original prediction | Prediction after perturbation | |
---|---|---|---|---|
11 | it takes a strange kind of laziness to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie . | it takes a strange kind of laziness to wazte the talwnts of robert forster , anne meara , eugene levy , and rebinald veljohnson all in the same movie .. | negative (p = 1.00) | positive (p = 1.00) |
21 | the iditarod lasts for days - this just felt like it did . | the irditarod lasts for days - this just felt ike it did . | negative (p = 0.96) | positive (p = 0.97) |
22 | holden caulfield did it better . | holdsn caulfkeld did t better . | positive (p = 0.97) | negative (p = 0.99) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Robustness | medium 🟡 | — | Fail rate = 0.059 | Punctuation Removal | 51/866 tested samples (5.89%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Punctuation Removal”, the model changes its prediction in 5.89% of the cases. We expected the predictions not to be affected by this transformation.text | Punctuation Removal(text) | Original prediction | Prediction after perturbation | |
---|---|---|---|---|
4 | it 's slow -- very , very slow . | it s slow very very slow | positive (p = 0.52) | negative (p = 0.77) |
33 | if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self-conscious to draw you deeply into its world . | if the movie succeeds in instilling a wary sense of there but for the grace of god it is far too self conscious to draw you deeply into its world | negative (p = 1.00) | positive (p = 0.99) |
66 | if you 're hard up for raunchy college humor , this is your ticket right here . | if you re hard up for raunchy college humor this is your ticket right here | positive (p = 0.89) | negative (p = 0.57) |
👉Performance issues (8)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | text_length(text) < 89.500 AND text_length(text) >= 80.500 |
Precision = 0.719 | — | -15.79% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 89.500 AND `text_length(text)` >= 80.500, the Precision is 15.79% lower than the global Precision.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
115 | sam mendes has become valedictorian at the school for soft landings and easy ways out . | 88 | negative | positive (p = 0.95) |
142 | what better message than ` love thyself ' could young women of any size receive ? | 82 | positive | negative (p = 1.00) |
286 | at its best , queen is campy fun like the vincent price horror classics of the '60s . | 86 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 5.511 |
Recall = 0.844 | — | -6.81% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 5.511, the Recall is 6.81% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
1 | unflinchingly bleak and desperate | 7.5 | negative | positive (p = 1.00) |
68 | good old-fashioned slash-and-hack is back ! | 6.33333 | positive | negative (p = 0.60) |
112 | hilariously inept and ridiculous . | 6 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.154 |
Recall = 0.844 | — | -6.81% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.154, the Recall is 6.81% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
1 | unflinchingly bleak and desperate | 0.117647 | negative | positive (p = 1.00) |
68 | good old-fashioned slash-and-hack is back ! | 0.136364 | positive | negative (p = 0.60) |
112 | hilariously inept and ridiculous . | 0.142857 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 4.354 AND avg_word_length(text) < 4.464 |
Precision = 0.800 | — | -6.27% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.354 AND `avg_word_length(text)` < 4.464, the Precision is 6.27% lower than the global Precision.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
86 | the film flat lines when it should peak and is more missed opportunity and trifle than dark , decadent truffle . | 4.38095 | negative | positive (p = 0.93) |
147 | the talented and clever robert rodriguez perhaps put a little too much heart into his first film and did n't reserve enough for his second . | 4.42308 | negative | positive (p = 0.97) |
448 | something akin to a japanese alice through the looking glass , except that it seems to take itself far more seriously . | 4.45455 | positive | negative (p = 0.84) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.187 AND avg_whitespace(text) >= 0.183 |
Precision = 0.800 | — | -6.27% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.187 AND `avg_whitespace(text)` >= 0.183, the Precision is 6.27% lower than the global Precision.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
86 | the film flat lines when it should peak and is more missed opportunity and trifle than dark , decadent truffle . | 0.185841 | negative | positive (p = 0.93) |
147 | the talented and clever robert rodriguez perhaps put a little too much heart into his first film and did n't reserve enough for his second . | 0.184397 | negative | positive (p = 0.97) |
448 | something akin to a japanese alice through the looking glass , except that it seems to take itself far more seriously . | 0.183333 | positive | negative (p = 0.84) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | text_length(text) < 59.500 AND text_length(text) >= 50.500 |
Precision = 0.800 | — | -6.27% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 59.500 AND `text_length(text)` >= 50.500, the Precision is 6.27% lower than the global Precision.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
139 | it 's not the ultimate depression-era gangster movie . | 55 | negative | positive (p = 0.98) |
183 | the lower your expectations , the more you 'll enjoy it . | 58 | negative | positive (p = 0.99) |
205 | falls neatly into the category of good stupid fun . | 52 | positive | negative (p = 0.92) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 4.123 AND avg_word_length(text) < 4.209 |
Recall = 0.850 | — | -6.12% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.123 AND `avg_word_length(text)` < 4.209, the Recall is 6.12% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
113 | this movie is maddening . | 4.2 | negative | positive (p = 1.00) |
121 | it seems to me the film is about the art of ripping people off without ever letting them consciously know you have done so | 4.125 | negative | positive (p = 0.98) |
142 | what better message than ` love thyself ' could young women of any size receive ? | 4.125 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.195 AND avg_whitespace(text) >= 0.192 |
Recall = 0.850 | — | -6.12% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.195 AND `avg_whitespace(text)` >= 0.192, the Recall is 6.12% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
113 | this movie is maddening . | 0.192308 | negative | positive (p = 1.00) |
121 | it seems to me the film is about the art of ripping people off without ever letting them consciously know you have done so | 0.195122 | negative | positive (p = 0.98) |
142 | what better message than ` love thyself ' could young women of any size receive ? | 0.195122 | positive | negative (p = 1.00) |
Checkout out the Giskard Space and test your model.
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.