akashgoel-id
/

OpenHathi-7B-English-to-Hinglish

text-generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

akashgoel-id commited on Dec 17, 2023

Commit

eb37f28

•

1 Parent(s): 29192a2

Update README.md

Files changed (1) hide show

README.md +7 -6

README.md CHANGED Viewed

@@ -9,17 +9,16 @@ pipeline_tag: translation
 This is a merge of lora trained on English to Hinglish translation dataset by NATERAW on llama2-7b and OPENHATHI-7B-BASE. Since openHathi has more hindi data in it's pretraining compared to llama2 the translation is significantly better.
 You can use the prompt template provided by nateraw
 "Translate from english to hinglish:\n{{en}}\n---\nTranslation:\n"
-Sample code:
 from transformers import LlamaForCausalLM, AutoTokenizer
 import torch
 device = "cuda:0"
 tokenizer = AutoTokenizer.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish')
 model = LlamaForCausalLM.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish', torch_dtype=torch.bfloat16).to(device)
@@ -34,10 +33,12 @@ while True:
     generate_ids = model.generate(inputs.input_ids, max_length=500)
     print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
-Limitations:
 The model is still not that good when it comes to idioms
 1) Input : When it rains, it pours
@@ -71,6 +72,6 @@ The model is still not that good when it comes to idioms
    Evaluation : This is a literal translation and doesn't quite capture the idiomatic meaning of avoiding the main point or not speaking directly about a subject. The phrase "Ghumaphira ke baat karna" would be more appropriate.
-Next steps:
 1) The model seems to be highly censored given it used llama2. Next step would be to remove some of censorship by finetuning on more uncensored data. (What WizardLM has done for llama2)
 2) Finetune on idioms

 This is a merge of lora trained on English to Hinglish translation dataset by NATERAW on llama2-7b and OPENHATHI-7B-BASE. Since openHathi has more hindi data in it's pretraining compared to llama2 the translation is significantly better.
+## Prompting
 You can use the prompt template provided by nateraw
 "Translate from english to hinglish:\n{{en}}\n---\nTranslation:\n"
+**Sample code**:
+```
 from transformers import LlamaForCausalLM, AutoTokenizer
 import torch
 device = "cuda:0"
 tokenizer = AutoTokenizer.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish')
 model = LlamaForCausalLM.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish', torch_dtype=torch.bfloat16).to(device)
     generate_ids = model.generate(inputs.input_ids, max_length=500)
     print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
+```
+## Limitations
 The model is still not that good when it comes to idioms
 1) Input : When it rains, it pours
    Evaluation : This is a literal translation and doesn't quite capture the idiomatic meaning of avoiding the main point or not speaking directly about a subject. The phrase "Ghumaphira ke baat karna" would be more appropriate.
+## Next steps
 1) The model seems to be highly censored given it used llama2. Next step would be to remove some of censorship by finetuning on more uncensored data. (What WizardLM has done for llama2)
 2) Finetune on idioms