traintogpb commited on
Commit
04ec70e
1 Parent(s): f3596b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -7
README.md CHANGED
@@ -1,29 +1,36 @@
1
  ---
2
- license: cc-by-sa-4.0
3
  datasets:
4
- - traintogpb/aihub-flores-koen-integrated-sparta-mini-300k
5
  language:
6
  - en
7
  - ko
 
 
 
8
  pipeline_tag: translation
 
 
 
 
 
9
  ---
10
  ### Pretrained LM
11
  - [beomi/Llama-3-Open-Ko-8B](https://huggingface.co/beomi/Llama-3-Open-Ko-8B) (MIT License)
12
 
13
  ### Training Dataset
14
- - [traintogpb/aihub-flores-koen-integrated-sparta-mini-300k](https://huggingface.co/datasets/traintogpb/aihub-flores-koen-integrated-sparta-mini-300k)
15
  - Can translate in Enlgish-Korean (bi-directional)
16
 
17
  ### Prompt
18
  - Template:
19
  ```python
20
- prompt = f"Translate this from {src_lang} to {tgt_lang}\n### {src_lang}: {src_text}\n### {tgt_lang}: "
21
 
22
  >>> # src_lang can be 'English', '한국어'
23
  >>> # tgt_lang can be '한국어', 'English'
24
  ```
25
- Mind that there is a "space (`_`)" at the end of the prompt (unpredictable first token will be popped up).
26
- But if you use vLLM, it's okay to remove the final space(`_`).
27
 
28
  ### Training
29
  - Trained with QLoRA
@@ -33,7 +40,7 @@ pipeline_tag: translation
33
  - Merge adapters and upscaled in BrainFloat 16-bit precision
34
 
35
  ### Usage (IMPORTANT)
36
- - Should remove the EOS token at the end of the prompt.
37
  ```python
38
  # MODEL
39
  model_name = 'traintogpb/llama-3-enko-translator-8b-qlora-bf16-upscaled'
@@ -46,6 +53,7 @@ pipeline_tag: translation
46
 
47
  tokenizer = AutoTokenizer.from_pretrained(adapter_name)
48
  tokenizer.pad_token_id = 128002 # eos_token_id and pad_token_id should be different
 
49
 
50
  text = "Someday, QWER will be the greatest girl band in the world."
51
  input_prompt = f"Translate this from English to 한국어.\n### English: {text}\n### 한국어:"
 
1
  ---
2
+ license: cc-by-nc-sa-4.0
3
  datasets:
4
+ - traintogpb/aihub-flores-koen-integrated-sparta-base-300k
5
  language:
6
  - en
7
  - ko
8
+ metrics:
9
+ - sacrebleu
10
+ - xcomet
11
  pipeline_tag: translation
12
+ tags:
13
+ - translation
14
+ - text-generation
15
+ - ko2en
16
+ - en2ko
17
  ---
18
  ### Pretrained LM
19
  - [beomi/Llama-3-Open-Ko-8B](https://huggingface.co/beomi/Llama-3-Open-Ko-8B) (MIT License)
20
 
21
  ### Training Dataset
22
+ - [traintogpb/aihub-flores-koen-integrated-prime-base-300k](https://huggingface.co/datasets/traintogpb/aihub-flores-koen-integrated-prime-base-300k)
23
  - Can translate in Enlgish-Korean (bi-directional)
24
 
25
  ### Prompt
26
  - Template:
27
  ```python
28
+ prompt = f"Translate this from {src_lang} to {tgt_lang}\n### {src_lang}: {src_text}\n### {tgt_lang}:"
29
 
30
  >>> # src_lang can be 'English', '한국어'
31
  >>> # tgt_lang can be '한국어', 'English'
32
  ```
33
+ Mind that there is no "space (`_`)" at the end of the prompt (unpredictable first token will be popped up).
 
34
 
35
  ### Training
36
  - Trained with QLoRA
 
40
  - Merge adapters and upscaled in BrainFloat 16-bit precision
41
 
42
  ### Usage (IMPORTANT)
43
+ - Should remove the EOS token (`<|end_of_text|>`, id=128001) at the end of the prompt.
44
  ```python
45
  # MODEL
46
  model_name = 'traintogpb/llama-3-enko-translator-8b-qlora-bf16-upscaled'
 
53
 
54
  tokenizer = AutoTokenizer.from_pretrained(adapter_name)
55
  tokenizer.pad_token_id = 128002 # eos_token_id and pad_token_id should be different
56
+ # tokenizer.add_eos_token = False # There is no 'add_eos_token' option in llama3
57
 
58
  text = "Someday, QWER will be the greatest girl band in the world."
59
  input_prompt = f"Translate this from English to 한국어.\n### English: {text}\n### 한국어:"