etomoscow commited on
Commit
ea1823e
1 Parent(s): 52f1846

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -1
README.md CHANGED
@@ -6,4 +6,56 @@ language:
6
  - en
7
  library_name: transformers
8
  pipeline_tag: text2text-generation
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - en
7
  library_name: transformers
8
  pipeline_tag: text2text-generation
9
+ ---
10
+
11
+ ## Model Description
12
+
13
+ This is the model presented in the paper "Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification".
14
+
15
+ The model is based on [mBART-large-50](https://huggingface.co/facebook/mbart-large-50) and trained on two parallel detoxification corpora: [ParaDetox](https://huggingface.co/datasets/s-nlp/paradetox) and [RuDetox](https://github.com/s-nlp/russe_detox_2022/tree/main/data). More details about this model are in the paper.
16
+
17
+
18
+ ## Usage
19
+
20
+ 1. Model loading.
21
+ ```python
22
+ from transformers import MBartForConditionalGeneration, AutoTokenizer
23
+
24
+ model = MBartForConditionalGeneration.from_pretrained("s-nlp/mBART_EN_RU").cuda()
25
+ tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50")
26
+
27
+ ```
28
+
29
+ 2. Detoxification utility.
30
+ ```python
31
+ def paraphrase(text, model, tokenizer, n=None, max_length="auto", beams=3):
32
+ texts = [text] if isinstance(text, str) else text
33
+ inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to(
34
+ model.device
35
+ )
36
+ if max_length == "auto":
37
+ max_length = inputs.shape[1] + 10
38
+
39
+ result = model.generate(
40
+ inputs,
41
+ num_return_sequences=n or 1,
42
+ do_sample=True,
43
+ temperature=1.0,
44
+ repetition_penalty=10.0,
45
+ max_length=max_length,
46
+ min_length=int(0.5 * max_length),
47
+ num_beams=beams,
48
+ forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang]
49
+ )
50
+ texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]
51
+
52
+ if not n and isinstance(text, str):
53
+ return texts[0]
54
+ return texts
55
+ ```
56
+
57
+
58
+ ## Citation
59
+
60
+
61
+ TBD