Text2Text Generation
Transformers
7 languages
Inference Endpoints
File size: 4,291 Bytes
afb9501
6f21a96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cacb43d
3ff80f6
 
 
 
 
 
 
 
 
 
 
afb9501
6f21a96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c16ab8d
 
 
 
 
 
 
 
 
 
6f21a96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c16ab8d
6f21a96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2b0fe9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: cc-by-nc-sa-4.0
datasets:
- wi_locness
- matejklemen/falko_merlin
- paws
- paws-x
- asset
language:
- en
- de
- es
- ar
- ja
- ko
- zh
metrics:
- bleu
- rouge
- sari
- accuracy
library_name: transformers
widget:
- text: >-
    Umschreiben sie den satz: When I grow up, I start to understand what he said
    is quite right.
  example_title: GEC (de|en)
- text: >-
    문장의 간단한 버전 작성: Cuando se pueden mantener tasas de flujo comparables, los
    resultados son altos.
  example_title: Simplification (ko|es)
- text: 'Paraphrase this: いちごは物語を紹介し、読者をイベントに導くと彼は言った。'
  example_title: Paraphrase (en|ja)
pipeline_tag: text2text-generation
---

# Model Card for mEdIT-xl

The `medit-xl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-7b-lora` model on the mEdIT dataset.

**Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning

**Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar

## Model Details

### Model Description

- **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish
- **Finetuned from model:** `MBZUAI/bactrian-x-llama-7b-lora`

### Model Sources

- **Repository:** https://github.com/vipulraheja/medit
- **Paper:** https://arxiv.org/abs/2402.16472v1

## How to use

Given an edit instruction and an original text, our model can generate the edited version of the text.<br>

![task_specs](https://cdn-uploads.huggingface.co/production/uploads/60985a0547dc3dbf8a976607/816ZY2t0XPCpMMd6Z072K.png)

Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual
vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text.

### Instruction format

Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.

```
instruction_tokens = [
    "Instruction",
    "Anweisung",
    ...
]

input_tokens = [
    "Input",
    "Aporte",
    ...
]

output_tokens = [
    "Output",
    "Produzione",
    ...
]

task_descriptions = [
    "Fix grammatical errors in this sentence",  # <-- GEC task
    "Umschreiben Sie den Satz",                 # <-- Paraphrasing
    ...
]
```

**The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.**

```
prompt_template = """### <instruction_token>:\n<task_description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""
```

Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision).


### Run the model

**Make sure you have the following libraries installed:**
```
- peft
- protobuf
- sentencepiece
- tokenizers
- torch
- transformers
```

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "grammarly/medit-xl"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

# English GEC using Japanese instructions
prompt = '### 命令:\n文章を文法的にする\n### 入力:\nI has small cat ,\n### 出力:\n\n'

inputs = tokenizer(prompt, return_tensors='pt')

outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# --> I have a small cat ,

# German GEC using Japanese instructions
prompt = '### 命令:\n文章を文法的にする\n### 入力:\nIch haben eines kleines Katze ,\n### 出力:\n\n'

# ...
# --> Ich habe eine kleine Katze ,
```

#### Software
https://github.com/vipulraheja/medit

## Citation

**BibTeX:**
```
@article{raheja2023medit,
      title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning}, 
      author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar},
      year={2024},
      eprint={2402.16472v1},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

**APA:**
Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472