File size: 3,679 Bytes
3dde14c f402369 3dde14c 2d449f4 3dde14c e02825a e3e2445 3dde14c 95c22d4 3dde14c a15b290 6a09960 a15b290 e3e2445 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
license: cc-by-4.0
language:
- am
- ru
- en
- uk
- de
- ar
- zh
- es
- hi
datasets:
- s-nlp/ru_paradetox
- s-nlp/paradetox
- textdetox/multilingual_paradetox
library_name: transformers
pipeline_tag: text2text-generation
---
# mT0-XL-detox-orpo
**Resources**:
* [Paper](https://arxiv.org/abs/2407.05449)
* [GitHub with training scripts and data](https://github.com/s-nlp/multilingual-transformer-detoxification)
## Model Information
This is a multilingual 3.7B text detoxification model for 9 languages built on [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) based on [mT0-XL](https://huggingface.co/bigscience/mt0-xl). The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details.
In terms of human evaluation, the model is a second-best approach on the [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html). More precisely, the model shows state-of-the-art performance for the Ukrainian language, top-2 scores for Arabic, and near state-of-the-art performance for other languages.
## Example usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto")
tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo')
LANG_PROMPTS = {
'zh': '排毒:',
'es': 'Desintoxicar: ',
'ru': 'Детоксифицируй: ',
'ar': 'إزالة السموم: ',
'hi': 'विषहरण: ',
'uk': 'Детоксифікуй: ',
'de': 'Entgiften: ',
'am': 'መርዝ መርዝ: ',
'en': 'Detoxify: ',
}
def detoxify(text, lang, model, tokenizer):
encodings = tokenizer(LANG_PROMPTS[lang] + text, return_tensors='pt').to(model.device)
outputs = model.generate(**encodings.to(model.device),
max_length=128,
num_beams=10,
no_repeat_ngram_size=3,
repetition_penalty=1.2,
num_beam_groups=5,
diversity_penalty=2.5,
num_return_sequences=5,
early_stopping=True,
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
```
## Citation
```
@inproceedings{smurfcat_at_pan,
author = {Elisei Rykov and
Konstantin Zaytsev and
Ivan Anisimov and
Alexandr Voronin},
editor = {Guglielmo Faggioli and
Nicola Ferro and
Petra Galusc{\'{a}}kov{\'{a}} and
Alba Garc{\'{\i}}a Seco de Herrera},
title = {SmurfCat at {PAN} 2024 TextDetox: Alignment of Multilingual Transformers
for Text Detoxification},
booktitle = {Working Notes of the Conference and Labs of the Evaluation Forum {(CLEF}
2024), Grenoble, France, 9-12 September, 2024},
series = {{CEUR} Workshop Proceedings},
volume = {3740},
pages = {2866--2871},
publisher = {CEUR-WS.org},
year = {2024},
url = {https://ceur-ws.org/Vol-3740/paper-276.pdf},
timestamp = {Wed, 21 Aug 2024 22:46:00 +0200},
biburl = {https://dblp.org/rec/conf/clef/RykovZAV24.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
|