etomoscow commited on
Commit
7a7ebe4
1 Parent(s): 7a46731

Create README.md

Browse files

## **Model Overview**

This is the model presented in the ACL-ICJNLP 2023 paper ["Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification"](https://aclanthology.org/2022.acl-long.469/).

The model itself is [`mBART-large-50`](https://huggingface.co/facebook/mbart-large-50) model trained on parallel detoxification datasets [ParaDetox](https://huggingface.co/datasets/s-nlp/paradetox) and [RuDetox](https://github.com/s-nlp/russe_detox_2022) providing detoxification for both **Russian** and **English** languages. More details are presented in the paper.

## **How to use**

1. Load the model checkpoint.

```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("s-nlp/mBART_EN_RU")
model = BartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
```

2. Define helper function.
```python
def paraphrase(text, model, tokenizer, n=None, max_length="auto", beams=5):
texts = [text] if isinstance(text, str) else text
inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to(
model.device
)

if max_length == "auto":
max_length = inputs.shape[1] + 10

result = model.generate(
inputs,
num_return_sequences=n or 1,
do_sample=False,
temperature=1.0,
repetition_penalty=10.0,
max_length=max_length,
min_length=int(0.5 * max_length),
num_beams=beams,
forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],
)
texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]

if not n and isinstance(text, str):
return texts[0]
return texts[0]
```
3. Generate.

**Citation**
```
TBD
```

Files changed (1) hide show
  1. README.md +9 -0
README.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - s-nlp/paradetox
4
+ language:
5
+ - ru
6
+ - en
7
+ library_name: transformers
8
+ pipeline_tag: text2text-generation
9
+ ---