grammarly
/

medit-xl

Text2Text Generation

Transformers

7 languages

Inference Endpoints

Model card Files Files and versions Community

machineteacher commited on Apr 15

Commit

6f21a96

•

1 Parent(s): afb9501

Update README.md

Browse files

Files changed (1) hide show

README.md +137 -1

README.md CHANGED Viewed

@@ -1,3 +1,139 @@
 ---
-license: cc-by-nc-4.0
 ---

 ---
+license: cc-by-nc-sa-4.0
+datasets:
+- wi_locness
+- matejklemen/falko_merlin
+- paws
+- paws-x
+- asset
+language:
+- en
+- de
+- es
+- ar
+- ja
+- ko
+- zh
+metrics:
+- bleu
+- rouge
+- sari
+- accuracy
+library_name: transformers
 ---
+# Model Card for mEdIT-xl
+The `medit-xl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-7b-lora` model on the mEdIT dataset.
+**Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning
+**Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar
+## Model Details
+### Model Description
+- **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish
+- **Finetuned from model:** `MBZUAI/bactrian-x-llama-7b-lora`
+### Model Sources
+- **Repository:** https://github.com/vipulraheja/medit
+- **Paper:** https://arxiv.org/abs/2402.16472v1
+## How to use
+Given an edit instruction and an original text, our model can generate the edited version of the text.<br>
+![task_specs](https://cdn-uploads.huggingface.co/production/uploads/60985a0547dc3dbf8a976607/816ZY2t0XPCpMMd6Z072K.png)
+Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual
+vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text.
+### Instruction format
+Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.
+```
+instruction_tokens = [
+    "Instruction",
+    "Anweisung",
+    ...
+]
+input_tokens = [
+    "Input",
+    "Aporte",
+    ...
+]
+output_tokens = [
+    "Output",
+    "Produzione",
+    ...
+]
+task_descriptions = [
+    "Fix grammatical errors in this sentence",  # <-- GEC task
+    "Umschreiben Sie den Satz",                 # <-- Paraphrasing
+    ...
+]
+```
+**The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.**
+```
+prompt_template = """### <instruction_token>:\n<task_description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""
+```
+Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision).
+### Run the model
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_id = "grammarly/medit-xl"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+# English GEC using Japanese instructions
+prompt = '### 命令:\n文章を文法的にする\n### 入力:\nI has small cat ,\n### 出力:\n\n'
+inputs = tokenizer(prompt, return_tensors='pt')
+outputs = model.generate(**inputs, max_new_tokens=20)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True)
+# --> I have a small cat ,
+# German GEC using Japanese instructions
+prompt = '### 命令:\n文章を文法的にする\n### 入力:\nIch haben eines kleines Katze ,\n### 出力:\n\n'
+# ...
+# --> Ich habe eine kleine Katze ,
+```
+#### Software
+https://github.com/vipulraheja/medit
+## Citation
+**BibTeX:**
+```
+@article{raheja2023medit,
+      title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning},
+      author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar},
+      year={2024},
+      eprint={2402.16472v1},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+**APA:**
+Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472