File size: 5,525 Bytes
6582be2 fe1a82d 6582be2 e022fa8 d899b51 6582be2 d899b51 6582be2 4202ea0 6582be2 e022fa8 fe1a82d ff7081b 6582be2 6f6b0ea fe1a82d 6582be2 fe1a82d 719cd25 6582be2 fe1a82d 719cd25 fe1a82d 6582be2 719cd25 fe1a82d 719cd25 fe1a82d 6582be2 e57c0f6 6582be2 e022fa8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
library_name: transformers
license: apache-2.0
base_model: Helsinki-NLP/opus-mt-mul-en
tags:
- generated_from_trainer
- code switching
- hinglish
- code mixing
metrics:
- bleu
model-index:
- name: marianMT_hin_eng_cs
results: []
language:
- hi
- en
datasets:
- ar5entum/hindi-english-code-mixed
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# marianMT_hin_eng_cs
This model is a fine-tuned version of [Helsinki-NLP/opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) on [ar5entum/hindi-english-code-mixed](https://huggingface.co/datasets/ar5entum/hindi-english-code-mixed) dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1450
- Bleu: 77.8649
- Gen Len: 74.8945
## Model description
The model is specifically designed to translate Hindi text written in Devanagari script into a mixed format where Hindi words are retained in Devanagari while English words are converted to Roman script. This model effectively handles the complexities of code-switching, producing output that accurately reflects the intended language mixing.
Example:
| Hindi | Hindi + English CS |
|:-----------------------------------------:|:-----------------------------------------:|
|तो वो टोटली मेरे घर के प्लान पे डिपेंड करता है |to वो totally मेरे घर के plan पे depend करता है |
|मांग लो भाई बहुत नेसेसरी है |मांग लो भाई बहुत necessary है |
|टेलीविज़न में क्या प्रोग्राम चल रहा है? |television में क्या program चल रहा है? |
```python
from transformers import MarianMTModel, MarianTokenizer
class HinEngCS:
def __init__(self, model_name='ar5entum/marianMT_hin_eng_cs'):
self.model_name = model_name
self.tokenizer = MarianTokenizer.from_pretrained(model_name)
self.model = MarianMTModel.from_pretrained(model_name)
def predict(self, input_text):
tokenized_text = self.tokenizer(input_text, return_tensors='pt')
translated = self.model.generate(**tokenized_text)
translated_text = self.tokenizer.decode(translated[0], skip_special_tokens=True)
return translated_text
model = HinEngCS()
input_text = "आज मैं नानयांग टेक्नोलॉजिकल यूनिवर्सिटी में अनेक समझौते होते हुए देखूंगा जो कि उच्च शिक्षा साइंस टेक्नोलॉजी और इनोवेशन में हमारे सहयोग को और बढ़ाएंगे।"
model.predict(input_text)
# आज मैं नानयांग technological university में अनेक समझौते होते हुए देखूंगा जो कि उच्च शिक्षा science technology और innovation में हमारे सहयोग को और बढ़ाएंगे।
```
## Training Procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 50
- eval_batch_size: 50
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- total_train_batch_size: 100
- total_eval_batch_size: 100
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 30.0
### Training results
| Training Loss | Epoch | Step | Bleu | Gen Len | Validation Loss |
|:-------------:|:-----:|:-----:|:-------:|:-------:|:---------------:|
| 1.5823 | 1.0 | 1118 | 11.6257 | 77.1622 | 1.1778 |
| 0.921 | 2.0 | 2236 | 33.2917 | 76.1459 | 0.6357 |
| 0.6472 | 3.0 | 3354 | 47.3533 | 75.9194 | 0.4504 |
| 0.5246 | 4.0 | 4472 | 55.2169 | 75.6871 | 0.3579 |
| 0.4228 | 5.0 | 5590 | 60.8262 | 75.5777 | 0.3041 |
| 0.3745 | 6.0 | 6708 | 64.8987 | 75.4424 | 0.2693 |
| 0.3552 | 7.0 | 7826 | 67.7607 | 75.2438 | 0.2455 |
| 0.3324 | 8.0 | 8944 | 69.635 | 75.1036 | 0.2274 |
| 0.2912 | 9.0 | 10062 | 71.3086 | 75.0326 | 0.2117 |
| 0.2591 | 10.0 | 11180 | 72.392 | 74.9607 | 0.2001 |
| 0.2471 | 11.0 | 12298 | 73.4758 | 74.9251 | 0.1899 |
| 0.236 | 12.0 | 13416 | 74.4219 | 74.833 | 0.1822 |
| 0.2265 | 13.0 | 14534 | 75.1435 | 74.9069 | 0.1745 |
| 0.2152 | 14.0 | 15652 | 75.7614 | 74.7409 | 0.1695 |
| 0.2078 | 15.0 | 16770 | 76.2353 | 74.7092 | 0.1641 |
| 0.2048 | 16.0 | 17888 | 76.7381 | 74.7274 | 0.1593 |
| 0.1975 | 17.0 | 19006 | 76.9954 | 74.7217 | 0.1559 |
| 0.1943 | 18.0 | 20124 | 77.421 | 74.6641 | 0.1524 |
| 0.1987 | 19.0 | 21242 | 77.8231 | 74.6833 | 0.1495 |
| 0.1855 | 20.0 | 22360 | 78.0784 | 74.6804 | 0.1472 |
### Framework versions
- Transformers 4.45.0.dev0
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1 |