danlou commited on
Commit
50c30f4
·
verified ·
1 Parent(s): 64e071b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -82,13 +82,13 @@ More examples available in the [project's GitHub](https://github.com/danlou/rela
82
 
83
  ## Safety testing
84
 
85
- While this model is intended for research purposes, it's still relevant to explore how this conversational model (and its self-supervised approach) compare on safety risk against other conversational models trained on the same base LLM.
86
 
87
  This safety risk was evaluated by measuring refusals on sets of harmful questions compiled specifically for testing safety alignment of LLMs, namely [HarmfulQA](https://huggingface.co/datasets/declare-lab/HarmfulQA) and [CategoricalHarmfulQA](https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQA).
88
  For comparison, we also evaluated [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407), [dolphin-2.9.3-mistral-nemo-12](https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b) and [Mistral-Nemo-Instruct-2407-abliterated](https://huggingface.co/natong19/Mistral-Nemo-Instruct-2407-abliterated).
89
  Responses were generated by greedy search, with models loaded as bfloat16. Refusal responses were detected using [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B). See [here](https://github.com/danlou/relay/tree/main/safety) for evaluation code.
90
 
91
- As can be seen in plot below, Relay v0.1 refuses to answer the majority of these harmful questions, and more often than popular uncensored models trained on the same base LLM. Still, it does not refuse as frequently as Mistral's Instruct fine-tune (or other official LLMs), suggesting lower chance of false positives (harmfulness of several questions is not consensual).
92
 
93
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60f808c5c1adf9100f1f263c/0m-dMagE7yKy1V-EB-fJ3.png" width="800"/>
94
 
 
82
 
83
  ## Safety testing
84
 
85
+ While this model is intended for research purposes, it's still relevant to explore how this conversational model (and its self-supervised approach) compares on safety risk against other conversational models trained on the same base LLM.
86
 
87
  This safety risk was evaluated by measuring refusals on sets of harmful questions compiled specifically for testing safety alignment of LLMs, namely [HarmfulQA](https://huggingface.co/datasets/declare-lab/HarmfulQA) and [CategoricalHarmfulQA](https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQA).
88
  For comparison, we also evaluated [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407), [dolphin-2.9.3-mistral-nemo-12](https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b) and [Mistral-Nemo-Instruct-2407-abliterated](https://huggingface.co/natong19/Mistral-Nemo-Instruct-2407-abliterated).
89
  Responses were generated by greedy search, with models loaded as bfloat16. Refusal responses were detected using [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B). See [here](https://github.com/danlou/relay/tree/main/safety) for evaluation code.
90
 
91
+ As can be seen in the plot below, Relay v0.1 refuses to answer the majority of these harmful questions, and more often than popular uncensored models trained on the same base LLM. Still, it does not refuse as frequently as Mistral's Instruct fine-tune (or other official LLMs), suggesting lower chance of false positives (harmfulness of several questions is not consensual).
92
 
93
  <img src="https://cdn-uploads.huggingface.co/production/uploads/60f808c5c1adf9100f1f263c/0m-dMagE7yKy1V-EB-fJ3.png" width="800"/>
94