traintogpb
/

llama-3-enko-translator-8b-qlora-bf16-upscaled

@@ -1,29 +1,36 @@
 ---
-license: cc-by-sa-4.0
 datasets:
-- traintogpb/aihub-flores-koen-integrated-sparta-mini-300k
 language:
 - en
 - ko
 pipeline_tag: translation
 ---
 ### Pretrained LM
 - [beomi/Llama-3-Open-Ko-8B](https://huggingface.co/beomi/Llama-3-Open-Ko-8B) (MIT License)
 ### Training Dataset
-- [traintogpb/aihub-flores-koen-integrated-sparta-mini-300k](https://huggingface.co/datasets/traintogpb/aihub-flores-koen-integrated-sparta-mini-300k)
 - Can translate in Enlgish-Korean (bi-directional)
 ### Prompt
 - Template:
   ```python
-    prompt = f"Translate this from {src_lang} to {tgt_lang}\n### {src_lang}: {src_text}\n### {tgt_lang}: "
     >>> # src_lang can be 'English', '한국어'
     >>> # tgt_lang can be '한국어', 'English'
   ```
-  Mind that there is a "space (`_`)" at the end of the prompt (unpredictable first token will be popped up).
-  But if you use vLLM, it's okay to remove the final space(`_`).
 ### Training
 - Trained with QLoRA
@@ -33,7 +40,7 @@ pipeline_tag: translation
 - Merge adapters and upscaled in BrainFloat 16-bit precision
 ### Usage (IMPORTANT)
-- Should remove the EOS token at the end of the prompt.
   ```python
     # MODEL
     model_name = 'traintogpb/llama-3-enko-translator-8b-qlora-bf16-upscaled'
@@ -46,6 +53,7 @@ pipeline_tag: translation
     tokenizer = AutoTokenizer.from_pretrained(adapter_name)
     tokenizer.pad_token_id = 128002 # eos_token_id and pad_token_id should be different
     text = "Someday, QWER will be the greatest girl band in the world."
     input_prompt = f"Translate this from English to 한국어.\n### English: {text}\n### 한국어:"

 ---
+license: cc-by-nc-sa-4.0
 datasets:
+- traintogpb/aihub-flores-koen-integrated-sparta-base-300k
 language:
 - en
 - ko
+metrics:
+- sacrebleu
+- xcomet
 pipeline_tag: translation
+tags:
+- translation
+- text-generation
+- ko2en
+- en2ko
 ---
 ### Pretrained LM
 - [beomi/Llama-3-Open-Ko-8B](https://huggingface.co/beomi/Llama-3-Open-Ko-8B) (MIT License)
 ### Training Dataset
+- [traintogpb/aihub-flores-koen-integrated-prime-base-300k](https://huggingface.co/datasets/traintogpb/aihub-flores-koen-integrated-prime-base-300k)
 - Can translate in Enlgish-Korean (bi-directional)
 ### Prompt
 - Template:
   ```python
+    prompt = f"Translate this from {src_lang} to {tgt_lang}\n### {src_lang}: {src_text}\n### {tgt_lang}:"
     >>> # src_lang can be 'English', '한국어'
     >>> # tgt_lang can be '한국어', 'English'
   ```
+  Mind that there is no "space (`_`)" at the end of the prompt (unpredictable first token will be popped up).
 ### Training
 - Trained with QLoRA
 - Merge adapters and upscaled in BrainFloat 16-bit precision
 ### Usage (IMPORTANT)
+- Should remove the EOS token (`<|end_of_text|>`, id=128001) at the end of the prompt.
   ```python
     # MODEL
     model_name = 'traintogpb/llama-3-enko-translator-8b-qlora-bf16-upscaled'
     tokenizer = AutoTokenizer.from_pretrained(adapter_name)
     tokenizer.pad_token_id = 128002 # eos_token_id and pad_token_id should be different
+    # tokenizer.add_eos_token = False # There is no 'add_eos_token' option in llama3
     text = "Someday, QWER will be the greatest girl band in the world."
     input_prompt = f"Translate this from English to 한국어.\n### English: {text}\n### 한국어:"