s-nlp
/

mt0-xl-detox-orpo

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

lmeribal commited on Jul 5, 2024

Commit

71ce5c0

·

verified ·

1 Parent(s): 2f408cc

Update README.md

Files changed (1) hide show

README.md +71 -3

README.md CHANGED Viewed

@@ -1,3 +1,71 @@
----
-license: cc-by-4.0
----

+---
+license: cc-by-4.0
+language:
+- am
+- ru
+- en
+- uk
+- de
+- ar
+- zh
+- es
+- hi
+datasets:
+- s-nlp/ru_paradetox
+- s-nlp/paradetox
+- textdetox/multilingual_paradetox
+library_name: transformers
+pipeline_tag: text2text-generation
+---
+# mT0-XL-detox-orpo
+**Resources**:
+* [Paper]()
+* [GitHub with training scripts and data](https://github.com/s-nlp/multilingual-transformer-detoxification)
+## Model Information
+This is a multilingual 3.7B text detoxification model built on [TextDetox 2024 shared task](https://pan.webis.de/clef24/pan24-web/text-detoxification.html) based on [mT0-xl](https://huggingface.co/bigscience/mt0-xl). The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details.
+## Example usage
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo')
+LANG_PROMPTS = {
+    'zh': '排毒：',
+    'es': 'Desintoxicar: ',
+    'ru': 'Детоксифицируй: ',
+    'ar': 'إزالة السموم: ',
+    'hi': 'विषहरण: ',
+    'uk': 'Детоксифікуй: ',
+    'de': 'Entgiften: ',
+    'am': 'መርዝ መርዝ: ',
+    'en': 'Detoxify: ',
+}
+def detoxify(text, lang, model, tokenizer):
+    encodings = tokenizer(LANG_PROMPTS[lang] + input_text, return_tensors='pt').to(model.device)
+    outputs = model.generate(**encodings.to(model.device),
+                             max_length=128,
+                             num_beams=10,
+                             no_repeat_ngram_size=3,
+                             repetition_penalty=1.2,
+                             num_beam_groups=5,
+                             diversity_penalty=2.5,
+                             num_return_sequences=5,
+                             early_stopping=True,
+                             )
+    return tokenizer.batch_decode(outputs, skip_special_tokens=True)
+```
+## Human evaluation
+## Automatic evaluation