danlou
/

relay-v0.1-Mistral-Nemo-2407

@@ -82,13 +82,13 @@ More examples available in the [project's GitHub](https://github.com/danlou/rela
 ## Safety testing
-While this model is intended for research purposes, it's still relevant to explore how this conversational model (and its self-supervised approach) compare on safety risk against other conversational models trained on the same base LLM.
 This safety risk was evaluated by measuring refusals on sets of harmful questions compiled specifically for testing safety alignment of LLMs, namely [HarmfulQA](https://huggingface.co/datasets/declare-lab/HarmfulQA) and [CategoricalHarmfulQA](https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQA).
 For comparison, we also evaluated [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407), [dolphin-2.9.3-mistral-nemo-12](https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b) and [Mistral-Nemo-Instruct-2407-abliterated](https://huggingface.co/natong19/Mistral-Nemo-Instruct-2407-abliterated).
 Responses were generated by greedy search, with models loaded as bfloat16. Refusal responses were detected using [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B). See [here](https://github.com/danlou/relay/tree/main/safety) for evaluation code.
-As can be seen in plot below, Relay v0.1 refuses to answer the majority of these harmful questions, and more often than popular uncensored models trained on the same base LLM. Still, it does not refuse as frequently as Mistral's Instruct fine-tune (or other official LLMs), suggesting lower chance of false positives (harmfulness of several questions is not consensual).
 <img src="https://cdn-uploads.huggingface.co/production/uploads/60f808c5c1adf9100f1f263c/0m-dMagE7yKy1V-EB-fJ3.png" width="800"/>

 ## Safety testing
+While this model is intended for research purposes, it's still relevant to explore how this conversational model (and its self-supervised approach) compares on safety risk against other conversational models trained on the same base LLM.
 This safety risk was evaluated by measuring refusals on sets of harmful questions compiled specifically for testing safety alignment of LLMs, namely [HarmfulQA](https://huggingface.co/datasets/declare-lab/HarmfulQA) and [CategoricalHarmfulQA](https://huggingface.co/datasets/declare-lab/CategoricalHarmfulQA).
 For comparison, we also evaluated [Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407), [dolphin-2.9.3-mistral-nemo-12](https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b) and [Mistral-Nemo-Instruct-2407-abliterated](https://huggingface.co/natong19/Mistral-Nemo-Instruct-2407-abliterated).
 Responses were generated by greedy search, with models loaded as bfloat16. Refusal responses were detected using [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B). See [here](https://github.com/danlou/relay/tree/main/safety) for evaluation code.
+As can be seen in the plot below, Relay v0.1 refuses to answer the majority of these harmful questions, and more often than popular uncensored models trained on the same base LLM. Still, it does not refuse as frequently as Mistral's Instruct fine-tune (or other official LLMs), suggesting lower chance of false positives (harmfulness of several questions is not consensual).
 <img src="https://cdn-uploads.huggingface.co/production/uploads/60f808c5c1adf9100f1f263c/0m-dMagE7yKy1V-EB-fJ3.png" width="800"/>