Update README.md
Browse files
README.md
CHANGED
@@ -247,6 +247,20 @@ Developers should apply responsible AI best practices, including mapping, measur
|
|
247 |
* Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
|
248 |
* Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
|
249 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
250 |
## Software
|
251 |
* [PyTorch](https://github.com/pytorch/pytorch)
|
252 |
* [Transformers](https://github.com/huggingface/transformers)
|
|
|
247 |
* Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
|
248 |
* Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
|
249 |
|
250 |
+
## Safety Evaluation and Red-Teaming
|
251 |
+
|
252 |
+
We leveraged various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets to
|
253 |
+
evaluate Phi-3.5 models' propensity to produce undesirable outputs across multiple languages and risk categories.
|
254 |
+
Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety
|
255 |
+
post-training that was done as detailed in the [Phi-3 Safety Post-Training paper](https://arxiv.org/pdf/2407.13833) had a positive impact across multiple languages and risk categories as observed by
|
256 |
+
refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Note, however, while comprehensive red team evaluations were conducted
|
257 |
+
across all models in the prior release of Phi models, red teaming was largely focused on Phi-3.5 MOE across multiple languages and risk categories for this release as
|
258 |
+
it is the largest and more capable model of the three models. Details on prior red team evaluations across Phi models can be found in the [Phi-3 Safety Post-Training paper](https://arxiv.org/pdf/2407.13833).
|
259 |
+
For this release, insights from red teaming indicate that the models may refuse to generate undesirable outputs in English, even when the request for undesirable output
|
260 |
+
is in another language. Models may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings
|
261 |
+
highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages,
|
262 |
+
and risk areas that account for cultural nuances where those languages are spoken.
|
263 |
+
|
264 |
## Software
|
265 |
* [PyTorch](https://github.com/pytorch/pytorch)
|
266 |
* [Transformers](https://github.com/huggingface/transformers)
|