cyber-chris commited on
Commit
d9a22af
·
1 Parent(s): ffb335a

update README

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -66,8 +66,13 @@ There is also a secondary goal of ensuring that the outputs remain high quality,
66
 
67
  ### Deception Detection "Classifier" Metrics
68
 
69
- The best accuracy for a threshold on the simple classification problem was `0.75`.
 
 
 
 
 
70
  ![output (1)](https://github.com/user-attachments/assets/72d2739b-88d4-4cf5-9de4-31c2d043d8ba)
71
 
72
- The best accuracy for a threshold on the red-team dataset was `0.65`.
73
  ![output (2)](https://github.com/user-attachments/assets/deadc28f-6729-4a4d-a5b9-60378e6ea7f8)
 
66
 
67
  ### Deception Detection "Classifier" Metrics
68
 
69
+ I use two small datasets to quickly evaluate my approach:
70
+
71
+ - "Simple Dataset": composed of a subset of Stanford's Alpaca dataset and JailbreakBench's dataset. This represents the "easy" classification problem of distinguishing obviously harmless prompts from obviously harmful prompts.
72
+ - "Red-team Dataset": composed of a subset of JailbreakBench's dataset. The benign prompts are less obviously harmless, and pose a more challenging problem to distinguish which prompts should actually be refused.
73
+
74
+ The best accuracy over the threshold settings on the simple classification problem was `0.75`.
75
  ![output (1)](https://github.com/user-attachments/assets/72d2739b-88d4-4cf5-9de4-31c2d043d8ba)
76
 
77
+ The best accuracy over the threshold settings on the red-team dataset was `0.65`.
78
  ![output (2)](https://github.com/user-attachments/assets/deadc28f-6729-4a4d-a5b9-60378e6ea7f8)