akashgoel-id
commited on
Commit
•
eb37f28
1
Parent(s):
29192a2
Update README.md
Browse files
README.md
CHANGED
@@ -9,17 +9,16 @@ pipeline_tag: translation
|
|
9 |
|
10 |
This is a merge of lora trained on English to Hinglish translation dataset by NATERAW on llama2-7b and OPENHATHI-7B-BASE. Since openHathi has more hindi data in it's pretraining compared to llama2 the translation is significantly better.
|
11 |
|
12 |
-
|
13 |
You can use the prompt template provided by nateraw
|
14 |
"Translate from english to hinglish:\n{{en}}\n---\nTranslation:\n"
|
15 |
|
16 |
|
17 |
-
Sample code
|
18 |
-
|
19 |
from transformers import LlamaForCausalLM, AutoTokenizer
|
20 |
import torch
|
21 |
|
22 |
-
|
23 |
device = "cuda:0"
|
24 |
tokenizer = AutoTokenizer.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish')
|
25 |
model = LlamaForCausalLM.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish', torch_dtype=torch.bfloat16).to(device)
|
@@ -34,10 +33,12 @@ while True:
|
|
34 |
generate_ids = model.generate(inputs.input_ids, max_length=500)
|
35 |
print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
|
36 |
|
|
|
|
|
37 |
|
38 |
|
39 |
|
40 |
-
Limitations
|
41 |
The model is still not that good when it comes to idioms
|
42 |
|
43 |
1) Input : When it rains, it pours
|
@@ -71,6 +72,6 @@ The model is still not that good when it comes to idioms
|
|
71 |
Evaluation : This is a literal translation and doesn't quite capture the idiomatic meaning of avoiding the main point or not speaking directly about a subject. The phrase "Ghumaphira ke baat karna" would be more appropriate.
|
72 |
|
73 |
|
74 |
-
Next steps
|
75 |
1) The model seems to be highly censored given it used llama2. Next step would be to remove some of censorship by finetuning on more uncensored data. (What WizardLM has done for llama2)
|
76 |
2) Finetune on idioms
|
|
|
9 |
|
10 |
This is a merge of lora trained on English to Hinglish translation dataset by NATERAW on llama2-7b and OPENHATHI-7B-BASE. Since openHathi has more hindi data in it's pretraining compared to llama2 the translation is significantly better.
|
11 |
|
12 |
+
## Prompting
|
13 |
You can use the prompt template provided by nateraw
|
14 |
"Translate from english to hinglish:\n{{en}}\n---\nTranslation:\n"
|
15 |
|
16 |
|
17 |
+
**Sample code**:
|
18 |
+
```
|
19 |
from transformers import LlamaForCausalLM, AutoTokenizer
|
20 |
import torch
|
21 |
|
|
|
22 |
device = "cuda:0"
|
23 |
tokenizer = AutoTokenizer.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish')
|
24 |
model = LlamaForCausalLM.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish', torch_dtype=torch.bfloat16).to(device)
|
|
|
33 |
generate_ids = model.generate(inputs.input_ids, max_length=500)
|
34 |
print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
|
35 |
|
36 |
+
```
|
37 |
+
|
38 |
|
39 |
|
40 |
|
41 |
+
## Limitations
|
42 |
The model is still not that good when it comes to idioms
|
43 |
|
44 |
1) Input : When it rains, it pours
|
|
|
72 |
Evaluation : This is a literal translation and doesn't quite capture the idiomatic meaning of avoiding the main point or not speaking directly about a subject. The phrase "Ghumaphira ke baat karna" would be more appropriate.
|
73 |
|
74 |
|
75 |
+
## Next steps
|
76 |
1) The model seems to be highly censored given it used llama2. Next step would be to remove some of censorship by finetuning on more uncensored data. (What WizardLM has done for llama2)
|
77 |
2) Finetune on idioms
|