File size: 5,525 Bytes
6582be2
 
 
 
 
 
fe1a82d
 
 
6582be2
 
 
 
 
e022fa8
 
 
d899b51
 
6582be2
 
 
 
 
 
 
d899b51
6582be2
4202ea0
 
 
6582be2
 
 
e022fa8
 
fe1a82d
 
 
 
 
ff7081b
6582be2
6f6b0ea
fe1a82d
6582be2
fe1a82d
 
 
 
719cd25
6582be2
fe1a82d
719cd25
fe1a82d
 
 
 
6582be2
719cd25
fe1a82d
719cd25
fe1a82d
6582be2
e57c0f6
6582be2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e022fa8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
library_name: transformers
license: apache-2.0
base_model: Helsinki-NLP/opus-mt-mul-en
tags:
- generated_from_trainer
- code switching
- hinglish
- code mixing
metrics:
- bleu
model-index:
- name: marianMT_hin_eng_cs
  results: []
language:
- hi
- en
datasets:
- ar5entum/hindi-english-code-mixed
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# marianMT_hin_eng_cs

This model is a fine-tuned version of [Helsinki-NLP/opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) on [ar5entum/hindi-english-code-mixed](https://huggingface.co/datasets/ar5entum/hindi-english-code-mixed) dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1450
- Bleu: 77.8649
- Gen Len: 74.8945

## Model description

The model is specifically designed to translate Hindi text written in Devanagari script into a mixed format where Hindi words are retained in Devanagari while English words are converted to Roman script. This model effectively handles the complexities of code-switching, producing output that accurately reflects the intended language mixing. 

Example:
| Hindi                                     | Hindi + English CS                        |
|:-----------------------------------------:|:-----------------------------------------:|
|तो वो टोटली मेरे घर के प्लान पे डिपेंड करता है           |to वो totally मेरे घर के plan पे depend करता है  |
|मांग लो भाई बहुत नेसेसरी है                        |मांग लो भाई बहुत necessary है                  |
|टेलीविज़न में क्या प्रोग्राम चल रहा है?                  |television में क्या program चल रहा है?           |

```python
from transformers import MarianMTModel, MarianTokenizer

class HinEngCS:
    def __init__(self, model_name='ar5entum/marianMT_hin_eng_cs'):
        self.model_name = model_name
        self.tokenizer = MarianTokenizer.from_pretrained(model_name)
        self.model = MarianMTModel.from_pretrained(model_name)

    def predict(self, input_text):
        tokenized_text = self.tokenizer(input_text, return_tensors='pt')
        translated = self.model.generate(**tokenized_text)
        translated_text = self.tokenizer.decode(translated[0], skip_special_tokens=True)
        return translated_text
model = HinEngCS()

input_text = "आज मैं नानयांग टेक्नोलॉजिकल यूनिवर्सिटी में अनेक समझौते होते हुए देखूंगा जो कि उच्च शिक्षा साइंस टेक्नोलॉजी और इनोवेशन में हमारे सहयोग को और बढ़ाएंगे।"
model.predict(input_text)
# आज मैं नानयांग technological university में अनेक समझौते होते हुए देखूंगा जो कि उच्च शिक्षा science technology और innovation में हमारे सहयोग को और बढ़ाएंगे।
```

## Training Procedure
### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 50
- eval_batch_size: 50
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- total_train_batch_size: 100
- total_eval_batch_size: 100
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 30.0

### Training results

| Training Loss | Epoch | Step  | Bleu    | Gen Len | Validation Loss |
|:-------------:|:-----:|:-----:|:-------:|:-------:|:---------------:|
| 1.5823        | 1.0   | 1118  | 11.6257 | 77.1622 | 1.1778          |
| 0.921         | 2.0   | 2236  | 33.2917 | 76.1459 | 0.6357          |
| 0.6472        | 3.0   | 3354  | 47.3533 | 75.9194 | 0.4504          |
| 0.5246        | 4.0   | 4472  | 55.2169 | 75.6871 | 0.3579          |
| 0.4228        | 5.0   | 5590  | 60.8262 | 75.5777 | 0.3041          |
| 0.3745        | 6.0   | 6708  | 64.8987 | 75.4424 | 0.2693          |
| 0.3552        | 7.0   | 7826  | 67.7607 | 75.2438 | 0.2455          |
| 0.3324        | 8.0   | 8944  | 69.635  | 75.1036 | 0.2274          |
| 0.2912        | 9.0   | 10062 | 71.3086 | 75.0326 | 0.2117          |
| 0.2591        | 10.0  | 11180 | 72.392  | 74.9607 | 0.2001          |
| 0.2471        | 11.0  | 12298 | 73.4758 | 74.9251 | 0.1899          |
| 0.236         | 12.0  | 13416 | 74.4219 | 74.833  | 0.1822          |
| 0.2265        | 13.0  | 14534 | 75.1435 | 74.9069 | 0.1745          |
| 0.2152        | 14.0  | 15652 | 75.7614 | 74.7409 | 0.1695          |
| 0.2078        | 15.0  | 16770 | 76.2353 | 74.7092 | 0.1641          |
| 0.2048        | 16.0  | 17888 | 76.7381 | 74.7274 | 0.1593          |
| 0.1975        | 17.0  | 19006 | 76.9954 | 74.7217 | 0.1559          |
| 0.1943        | 18.0  | 20124 | 77.421  | 74.6641 | 0.1524          |
| 0.1987        | 19.0  | 21242 | 77.8231 | 74.6833 | 0.1495          |
| 0.1855        | 20.0  | 22360 | 78.0784 | 74.6804 | 0.1472          |


### Framework versions

- Transformers 4.45.0.dev0
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1