microsoft
/

Phi-3.5-MoE-instruct

Text Generation

Model card Files Files and versions Community

nguyenbh commited on Aug 20, 2024

Commit

c36d6a2

·

verified ·

1 Parent(s): 8f3232c

Update README.md

Files changed (1) hide show

README.md +14 -0

README.md CHANGED Viewed

@@ -247,6 +247,20 @@ Developers should apply responsible AI best practices, including mapping, measur
 * Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
 * Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
 ## Software
 * [PyTorch](https://github.com/pytorch/pytorch)
 * [Transformers](https://github.com/huggingface/transformers)

 * Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
 * Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
+## Safety Evaluation and Red-Teaming
+We leveraged various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets to
+evaluate Phi-3.5 models' propensity to produce undesirable outputs across multiple languages and risk categories.
+Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety
+post-training that was done as detailed in the [Phi-3 Safety Post-Training paper](https://arxiv.org/pdf/2407.13833) had a positive impact across multiple languages and risk categories as observed by
+refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Note, however, while comprehensive red team evaluations were conducted
+across all models in the prior release of Phi models, red teaming was largely focused on Phi-3.5 MOE across multiple languages and risk categories for this release as
+it is the largest and more capable model of the three models. Details on prior red team evaluations across Phi models can be found in the [Phi-3 Safety Post-Training paper](https://arxiv.org/pdf/2407.13833).
+For this release, insights from red teaming indicate that the models may refuse to generate undesirable outputs in English, even when the request for undesirable output
+is in another language. Models may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings
+highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages,
+and risk areas that account for cultural nuances where those languages are spoken.
 ## Software
 * [PyTorch](https://github.com/pytorch/pytorch)
 * [Transformers](https://github.com/huggingface/transformers)