Spaces:
Runtime error
Runtime error
Commit
·
d9a22af
1
Parent(s):
ffb335a
update README
Browse files
README.md
CHANGED
@@ -66,8 +66,13 @@ There is also a secondary goal of ensuring that the outputs remain high quality,
|
|
66 |
|
67 |
### Deception Detection "Classifier" Metrics
|
68 |
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
70 |

|
71 |
|
72 |
-
The best accuracy
|
73 |

|
|
|
66 |
|
67 |
### Deception Detection "Classifier" Metrics
|
68 |
|
69 |
+
I use two small datasets to quickly evaluate my approach:
|
70 |
+
|
71 |
+
- "Simple Dataset": composed of a subset of Stanford's Alpaca dataset and JailbreakBench's dataset. This represents the "easy" classification problem of distinguishing obviously harmless prompts from obviously harmful prompts.
|
72 |
+
- "Red-team Dataset": composed of a subset of JailbreakBench's dataset. The benign prompts are less obviously harmless, and pose a more challenging problem to distinguish which prompts should actually be refused.
|
73 |
+
|
74 |
+
The best accuracy over the threshold settings on the simple classification problem was `0.75`.
|
75 |

|
76 |
|
77 |
+
The best accuracy over the threshold settings on the red-team dataset was `0.65`.
|
78 |

|