--- license: cc-by-4.0 language: - am - ru - en - uk - de - ar - zh - es - hi datasets: - s-nlp/ru_paradetox - s-nlp/paradetox - textdetox/multilingual_paradetox library_name: transformers pipeline_tag: text2text-generation --- # mT0-XL-detox-orpo **Resources**: * [Paper](https://arxiv.org/abs/2407.05449) * [GitHub with training scripts and data](https://github.com/s-nlp/multilingual-transformer-detoxification) ## Model Information This is a multilingual 3.7B text detoxification model for 9 languages built on [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) based on [mT0-XL](https://huggingface.co/bigscience/mt0-xl). The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details. In terms of human evaluation, the model is a second-best approach on the [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html). More precisely, the model shows state-of-the-art performance for the Ukrainian language, top-2 scores for Arabic, and near state-of-the-art performance for other languages. ## Example usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto") tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo') LANG_PROMPTS = { 'zh': '排毒:', 'es': 'Desintoxicar: ', 'ru': 'Детоксифицируй: ', 'ar': 'إزالة السموم: ', 'hi': 'विषहरण: ', 'uk': 'Детоксифікуй: ', 'de': 'Entgiften: ', 'am': 'መርዝ መርዝ: ', 'en': 'Detoxify: ', } def detoxify(text, lang, model, tokenizer): encodings = tokenizer(LANG_PROMPTS[lang] + text, return_tensors='pt').to(model.device) outputs = model.generate(**encodings.to(model.device), max_length=128, num_beams=10, no_repeat_ngram_size=3, repetition_penalty=1.2, num_beam_groups=5, diversity_penalty=2.5, num_return_sequences=5, early_stopping=True, ) return tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` ## Citation ``` @inproceedings{smurfcat_at_pan, author = {Elisei Rykov and Konstantin Zaytsev and Ivan Anisimov and Alexandr Voronin}, editor = {Guglielmo Faggioli and Nicola Ferro and Petra Galusc{\'{a}}kov{\'{a}} and Alba Garc{\'{\i}}a Seco de Herrera}, title = {SmurfCat at {PAN} 2024 TextDetox: Alignment of Multilingual Transformers for Text Detoxification}, booktitle = {Working Notes of the Conference and Labs of the Evaluation Forum {(CLEF} 2024), Grenoble, France, 9-12 September, 2024}, series = {{CEUR} Workshop Proceedings}, volume = {3740}, pages = {2866--2871}, publisher = {CEUR-WS.org}, year = {2024}, url = {https://ceur-ws.org/Vol-3740/paper-276.pdf}, timestamp = {Wed, 21 Aug 2024 22:46:00 +0200}, biburl = {https://dblp.org/rec/conf/clef/RykovZAV24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```