instructTrans-v2
Introduction
exaone3-instrucTrans-v2-enko-7.8b model is trained on translation datasets(english->korean) based on exaone-3-7.8B-it. To translate the English instruction dataset.
- nayohan/aihub-en-ko-translation-12m
- nayohan/instruction_en_ko_translation_1.4m
- Translation-EnKo/trc_uniform_313k_eval_45_filtered
Generating Text
This model supports translation from english to korean. To translate text, use the following Python code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Translation-EnKo/exaone3-instrucTrans-v2-enko-7.8b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16
)
system_prompt="๋น์ ์ ๋ฒ์ญ๊ธฐ ์
๋๋ค. ์์ด๋ฅผ ํ๊ตญ์ด๋ก ๋ฒ์ญํ์ธ์."
sentence = "The aerospace industry is a flower in the field of technology and science."
conversation = [{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': sentence}]
inputs = tokenizer.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_tensors='pt'
).to("cuda")
outputs = model.generate(inputs, max_new_tokens=4096) # Finetuned with length 8192
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
inference with vLLM
์ถ๋ก ์ฝ๋ ์ ๊ธฐ/ํผ์น๊ธฐ
# Requires at least a 24 GB Vram GPU. If you have 12GB VRAM, you will need to run in FP8 mode.
python vllm_inference.py -gpu_id 0 -split_idx 0 -split_num 2 -dname "nvidia/HelpSteer" -untrans_col 'helpfulness' 'correctness' 'coherence' 'complexity' 'verbosity' > 0.out
python vllm_inference.py -gpu_id 1 -split_idx 1 -split_num 2 -dname "nvidia/HelpSteer" -untrans_col 'helpfulness' 'correctness' 'coherence' 'complexity' 'verbosity' > 1.out
import os
import argparse
import pandas as pd
from tqdm import tqdm
from typing import List, Dict
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# truncate sentences with more than 4096 tokens. # for same dataset size
def truncation_func(sample, column_name):
input_ids = tokenizer(str(sample[column_name]), truncation=True, max_length=4096, add_special_tokens=False).input_ids
output = tokenizer.decode(input_ids)
sample[column_name]=output
return sample
# convert to chat_template
def create_conversation(sample, column_name):
SYSTEM_PROMPT=f"๋น์ ์ ๋ฒ์ญ๊ธฐ ์
๋๋ค. ์์ด ๋ฌธ์ฅ์ ํ๊ตญ์ด๋ก ๋ฒ์ญํ์ธ์."
messages=[
{"role":"system", "content": SYSTEM_PROMPT},
{"role":"user", "content":sample[column_name]}
]
text=tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
sample[column_name]=text
return sample
def load_dataset_preprocess(dataset_name:str, untranslate_column:List, split_num, split_idx, subset=None, num_proc=128) -> Dataset:
step = 100//split_num # split datasets
if subset:
dataset = load_dataset(dataset_name, subset, split=f'train[{step*split_idx}%:{step*(split_idx+1)}%]')
else:
dataset = load_dataset(dataset_name, split=f'train[{step*split_idx}%:{step*(split_idx+1)}%]')
print(dataset)
original_dataset = dataset # To leave columns untranslated
dataset = dataset.remove_columns(untranslate_column)
for feature in dataset.features:
dataset = dataset.map(lambda x: truncation_func(x,feature), num_proc=num_proc) #
dataset = dataset.map(lambda x: create_conversation(x,feature), batched=False, num_proc=num_proc)
print("filtered_dataset:", dataset)
return dataset, original_dataset
def save_dataset(result_dict:Dict, dataset_name, untranslate_column:List, split_idx, subset:str):
for column in untranslate_column:
result_dict[column] = original_dataset[column]
df = pd.DataFrame(result_dict)
output_file_name = dataset_name.split('/')[-1]
os.makedirs('gen', exist_ok=True)
if subset:
save_path = f"gen/{output_file_name}_{subset}_{split_idx}.jsonl"
else:
save_path = f"gen/{output_file_name}_{split_idx}.jsonl"
df.to_json(save_path, lines=True, orient='records', force_ascii=False)
if __name__=="__main__":
model_name = "Translation-EnKo/exaone3-instrucTrans-v2-enko-7.8b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
parser = argparse.ArgumentParser(description='load dataset name & split size')
parser.add_argument('-dname', type=str, default="Magpie-Align/Magpie-Pro-MT-300K-v0.1")
parser.add_argument('-untrans_col', nargs='+', default=[])
parser.add_argument('-split_num', type=int, default=4)
parser.add_argument('-split_idx', type=int, default=0)
parser.add_argument('-gpu_id', type=int, default=0)
parser.add_argument('-subset', type=str, default=None)
parser.add_argument('-num_proc', type=int, default=128)
args = parser.parse_args()
os.environ["CUDA_VISIBLE_DEVICES"]=str(args.gpu_id)
dataset, original_dataset = load_dataset_preprocess(args.dname,
args.untrans_col,
args.split_num,
args.split_idx,
args.subset,
args.num_proc
)
# define model
sampling_params = SamplingParams(
temperature=0,
max_tokens=8192,
)
llm = LLM(
model=model_name,
tensor_parallel_size=1,
gpu_memory_utilization=0.95,
)
# inference model
result_dict = {}
for feature in tqdm(dataset.features):
print(f"'{feature}' column in progress..")
outputs = llm.generate(dataset[feature], sampling_params)
result_dict[feature]=[output.outputs[0].text for output in outputs]
save_dataset(result_dict, args.dname, args.untrans_col, args.split_idx, args.subset)
print(f"saved to json. column: {feature}")
Result
# EVAL_RESULT (2405_KO_NEWS) (max_new_tokens=512)
"en_ref":"This controversy arose around a new advertisement for the latest iPad Pro that Apple released on YouTube on the 7th. The ad shows musical instruments, statues, cameras, and paints being crushed in a press, followed by the appearance of the iPad Pro in their place. It appears to emphasize the new iPad Pro's artificial intelligence features, advanced display, performance, and thickness. Apple mentioned that the newly unveiled iPad Pro is equipped with the latest 'M4' chip and is the thinnest device in Apple's history. The ad faced immediate backlash upon release, as it graphically depicts objects symbolizing creators being crushed. Critics argue that the imagery could be interpreted as technology trampling on human creators. Some have also voiced concerns that it evokes a situation where creators are losing ground due to AI."
"ko_ref":"์ด๋ฒ ๋
ผ๋์ ์ ํ์ด ์ง๋ 7์ผ ์ ํ๋ธ์ ๊ณต๊ฐํ ์ ํ ์์ดํจ๋ ํ๋ก ๊ด๊ณ ๋ฅผ ๋๋ฌ์ธ๊ณ ๋ถ๊ฑฐ์ก๋ค. ํด๋น ๊ด๊ณ ์์์ ์
๊ธฐ์ ์กฐ๊ฐ์, ์นด๋ฉ๋ผ, ๋ฌผ๊ฐ ๋ฑ์ ์์ฐฉ๊ธฐ๋ก ์ง๋๋ฅธ ๋ค ๊ทธ ์๋ฆฌ์ ์์ดํจ๋ ํ๋ก๋ฅผ ๋ฑ์ฅ์ํค๋ ๋ด์ฉ์ด์๋ค. ์ ํ ์์ดํจ๋ ํ๋ก์ ์ธ๊ณต์ง๋ฅ ๊ธฐ๋ฅ๋ค๊ณผ ์งํ๋ ๋์คํ๋ ์ด์ ์ฑ๋ฅ, ๋๊ป ๋ฑ์ ๊ฐ์กฐํ๊ธฐ ์ํ ์ทจ์ง๋ก ํ์ด๋๋ค. ์ ํ์ ์ด๋ฒ์ ๊ณต๊ฐํ ์์ดํจ๋ ํ๋ก์ ์ ํ โM4โ ์นฉ์ด ํ์ฌ๋๋ฉฐ ๋๊ป๋ ์ ํ์ ์ญ๋ ์ ํ ์ค ๊ฐ์ฅ ์๋ค๋ ์ค๋ช
๋ ๋ง๋ถ์๋ค. ๊ด๊ณ ๋ ๊ณต๊ฐ ์งํ ๊ฑฐ์ผ ๋นํ์ ์ง๋ฉดํ๋ค. ์ฐฝ์์๋ฅผ ์์งํ๋ ๋ฌผ๊ฑด์ด ์ง๋๋ ค์ง๋ ๊ณผ์ ์ ์ง๋์น๊ฒ ์ ๋๋ผํ๊ฒ ๋ฌ์ฌํ ์ ์ด ๋ฌธ์ ๊ฐ ๋๋ค. ๊ธฐ์ ์ด ์ธ๊ฐ ์ฐฝ์์๋ฅผ ์ง๋ฐ๋ ๋ชจ์ต์ ๋ฌ์ฌํ ๊ฒ์ผ๋ก ํด์๋ ์ฌ์ง๊ฐ ์๋ค๋ ๋ฌธ์ ์์์ด๋ค. ์ธ๊ณต์ง๋ฅ(AI)์ผ๋ก ์ธํด ์ฐฝ์์๊ฐ ์ค ์๋ฆฌ๊ฐ ์ค์ด๋๋ ์ํฉ์ ์ฐ์์ํจ๋ค๋ ๋ชฉ์๋ฆฌ๋ ๋์๋ค."
"exaone3-InstrucTrans-v2":"์ด๋ฒ ๋
ผ๋์ ์ ํ์ด ์ง๋ 7์ผ ์ ํ๋ธ์ ๊ณต๊ฐํ ์ต์ ํ ์์ดํจ๋ ํ๋ก์ ์ ๊ด๊ณ ๋ฅผ ๋๋ฌ์ธ๊ณ ๋ถ๊ฑฐ์ก๋ค. ์ด ๊ด๊ณ ๋ ์
๊ธฐ, ์กฐ๊ฐ์, ์นด๋ฉ๋ผ, ๋ฌผ๊ฐ ๋ฑ์ด ํ๋ ์ค๊ธฐ์ ์ง๋๋ฆฌ๋ ์ฅ๋ฉด์ ์ด์ด ๊ทธ ์๋ฆฌ์ ์์ดํจ๋ ํ๋ก๊ฐ ๋ฑ์ฅํ๋ ์ฅ๋ฉด์ ๋ณด์ฌ์ค๋ค. ์๋ก์ด ์์ดํจ๋ ํ๋ก์ ์ธ๊ณต์ง๋ฅ ๊ธฐ๋ฅ, ์ฒจ๋จ ๋์คํ๋ ์ด, ์ฑ๋ฅ, ๋๊ป๋ฅผ ๊ฐ์กฐํ๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค. ์ ํ์ ์ด๋ฒ์ ๊ณต๊ฐ๋ ์์ดํจ๋ ํ๋ก์ ์ต์ 'M4' ์นฉ์ด ํ์ฌ๋์ผ๋ฉฐ, ์ ํ ์ญ์ฌ์ ๊ฐ์ฅ ์์ ๋๊ป๋ฅผ ์๋ํ๋ค๊ณ ์ธ๊ธํ๋ค. ์ด ๊ด๊ณ ๋ ๊ณต๊ฐ๋์๋ง์ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์์งํ๋ ์ฌ๋ฌผ๋ค์ด ์ง๋ฐํ๋ ์ฅ๋ฉด์ ๊ทธ๋ํฝ์ผ๋ก ํํํด ์ฆ๊ฐ์ ์ธ ๋ฐ๋ฐ์ ๋ถ๋ชํ๋ค. ๋นํ๊ฐ๋ค์ ์ด ์ด๋ฏธ์ง๊ฐ ๊ธฐ์ ์ด ์ธ๊ฐ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์ง๋ฐ๋ ๊ฒ์ผ๋ก ํด์๋ ์ ์๋ค๊ณ ์ฃผ์ฅํ๋ค. ์ผ๋ถ์์๋ AI๋ก ์ธํด ํฌ๋ฆฌ์์ดํฐ๋ค์ด ์ค ์๋ฆฌ๋ฅผ ์๋ ์ํฉ์ ์ฐ์์ํจ๋ค๋ ์ฐ๋ ค์ ๋ชฉ์๋ฆฌ๋ ๋์๋ค."
"llama3-InstrucTrans":"์ด๋ฒ ๋
ผ๋์ ์ ํ์ด ์ง๋ 7์ผ ์ ํ๋ธ์ ๊ณต๊ฐํ ์ต์ ์์ดํจ๋ ํ๋ก ๊ด๊ณ ๋ฅผ ์ค์ฌ์ผ๋ก ๋ถ๊ฑฐ์ก๋ค. ์ด ๊ด๊ณ ๋ ์
๊ธฐ, ์กฐ๊ฐ์, ์นด๋ฉ๋ผ, ๋ฌผ๊ฐ ๋ฑ์ ๋๋ฅด๊ธฐ ์์ํ๋ ์ฅ๋ฉด๊ณผ ํจ๊ป ๊ทธ ์๋ฆฌ์ ์์ดํจ๋ ํ๋ก๊ฐ ๋ฑ์ฅํ๋ ์ฅ๋ฉด์ ๋ณด์ฌ์ค๋ค. ์ด๋ ์๋ก์ด ์์ดํจ๋ ํ๋ก์ ์ธ๊ณต์ง๋ฅ ๊ธฐ๋ฅ, ๊ณ ๊ธ ๋์คํ๋ ์ด, ์ฑ๋ฅ, ๋๊ป๋ฅผ ๊ฐ์กฐํ๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค. ์ ํ์ ์ด๋ฒ์ ๊ณต๊ฐํ ์์ดํจ๋ ํ๋ก์ ์ต์ 'M4' ์นฉ์ด ํ์ฌ๋์ผ๋ฉฐ, ์ ํ ์ญ์ฌ์ ๊ฐ์ฅ ์์ ๊ธฐ๊ธฐ๋ผ๊ณ ์ธ๊ธํ๋ค. ์ด ๊ด๊ณ ๋ ์ถ์ํ์๋ง์ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์์งํ๋ ๋ฌผ๊ฑด์ด ํ์๋๋ ์ฅ๋ฉด์ด ๊ทธ๋๋ก ๊ทธ๋ ค์ ธ ๋
ผ๋์ด ๋๊ณ ์๋ค. ๋นํ๊ฐ๋ค์ ์ด ์ด๋ฏธ์ง๊ฐ ๊ธฐ์ ์ด ์ธ๊ฐ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์ง๋ฐ๋๋ค๋ ์๋ฏธ๋ก ํด์๋ ์ ์๋ค๊ณ ์ฃผ์ฅํ๋ค. ๋ํ AI๋ก ์ธํด ํฌ๋ฆฌ์์ดํฐ๋ค์ด ๋ฐ๋ฆฌ๊ณ ์๋ค๋ ์ํฉ์ ์ฐ์์ํจ๋ค๋ ์ฐ๋ ค์ ๋ชฉ์๋ฆฌ๋ ๋์จ๋ค."
Evalution Result
์์ด->ํ๊ตญ์ด ๋ฒ์ญ ์ฑ๋ฅ์ ํ๊ฐํ๊ธฐ์ํ ๋ฐ์ดํฐ์ ์ ์ ์ ํ์ฌ ํ๊ฐ๋ฅผ ์งํํ์์ต๋๋ค.
ํ๊ฐ ๋ฐ์ดํฐ์ ์ถ์ฒ
- Aihub/FLoRes: traintogpb/aihub-flores-koen-integrated-sparta-30k | (test set 1k)
- iwslt-2023 : shreevigneshs/iwslt-2023-en-ko-train-val-split-0.1 | (f_test 597, if_test 597)
- ko_news_2024: nayohan/ko_news_eval40 | (40)
๋ชจ๋ธ ํ๊ฐ๋ฐฉ๋ฒ
- ๋ณธ ํ๊ฐ์์๋ ์ด์ (hf)๊ณผ ๋ฌ๋ฆฌ vLLM์ ํ์ฉํ์ฌ ์ถ๋ก ํ์ฌ ํ๊ฐํ์์ต๋๋ค. (๊ณตํต: max_new_tokens=512)
- ๊ฐ ์์ธํ ํ๊ฐ ๋ด์ฉ์ ๊ธฐ์กด์ instruct-Trans ๊ฒฐ๊ณผ๋ฅผ ๋ฐ๋์ต๋๋ค. [๋งํฌ]
Average
- vLLM์ ํ์ฉํ๋ HF๋ณด๋ค ์ ์ฒด์ ์ผ๋ก ์ ์๊ฐ ๋ฎ์์ก์ต๋๋ค.
๋ชจ๋ธ ๋ณ ์ฑ๋ฅ ๋น๊ต
๋ชจ๋ธ ์ด๋ฆ | AIHub | Flores | IWSLT | News | ํ๊ท |
---|---|---|---|---|---|
Meta-Llama | |||||
meta-llama/Meta-Llama-3-8B-Instruct | 0.3075 | 0.295 | 2.395 | 0.17 | 0.7919 |
nayohan/llama3-8b-it-translation-general-en-ko-1sent | 15.7875 | 8.09 | 4.445 | 4.68 | 8.2506 |
nayohan/llama3-instrucTrans-enko-8b | 16.3938 | 9.63 | 5.405 | 5.3225 | 9.1878 |
nayohan/llama3-8b-it-general-trc313k-enko-8k | 14.7225 | 10.47 | 4.45 | 7.555 | 9.2994 |
Gemma | |||||
Translation-EnKo/gemma-2-2b-it-general1.2m-trc313eval45 | 13.7775 | 7.88 | 3.95 | 6.105 | 7.9281 |
Translation-EnKo/gemma-2-9b-it-general1.2m-trc313eval45 | 18.9887 | 13.215 | 6.28 | 9.975 | 12.1147 |
Translation-EnKo/gukbap-gemma-2-9b-it-general1.2m-trc313eval45 | 18.405 | 12.44 | 6.59 | 9.64 | 11.7688 |
EXAONE | |||||
CarrotAI/EXAONE-3.0-7.8B-Instruct-Llamafied-8k | 4.9375 | 4.9 | 1.58 | 8.215 | 4.9081 |
Translation-EnKo/exaeon3-translation-general-enko-7.8b (private) | 17.8275 | 8.56 | 2.72 | 6.31 | 8.8544 |
Translation-EnKo/exaone3-instrucTrans-v2-enko-7.8b | 19.6075 | 13.46 | 7.28 | 11.4425 | 12.9475 |
ํ์ต ๋ฐ์ดํฐ์ ๋ณ ์ฑ๋ฅ ๋ถ์
๋ชจ๋ธ ์ด๋ฆ | AIHub | Flores | IWSLT | News | ํ๊ท |
---|---|---|---|---|---|
Meta-Llama | |||||
Meta-Llama-3-8B-Instruct | 0.3075 | 0.295 | 2.395 | 0.17 | 0.7919 |
llama3-8b-it-general1.2m-en-ko-4k | 15.7875 | 8.09 | 4.445 | 4.68 | 8.2506 |
llama3-8b-it-general1.2m-trc313k-enko-4k | 16.3938 | 9.63 | 5.405 | 5.3225 | 9.1878 |
llama3-8b-it-general1.2m-trc313k-enko-8k | 14.7225 | 10.47 | 4.45 | 7.555 | 9.2994 |
Gemma | |||||
gemma-2-2b-it-general1.2m-trc313eval45 | 13.7775 | 7.88 | 3.95 | 6.105 | 7.9281 |
gemma-2-9b-it-general1.2m-trc313eval45 | 18.9887 | 13.215 | 6.28 | 9.975 | 12.1147 |
gukbap-gemma-2-9b-it-general1.2m-trc313eval45 | 18.405 | 12.44 | 6.59 | 9.64 | 11.7688 |
EXAONE | |||||
EXAONE-3.0-7.8B-Instruct | 4.9375 | 4.9 | 1.58 | 8.215 | 4.9081 |
EXAONE-3.0-7.8B-Instruct-general12m (private) | 17.8275 | 8.56 | 2.72 | 6.31 | 8.8544 |
EXAONE-3.0-7.8B-Instruct-general12m-trc1400k-trc313eval45 | 19.6075 | 13.46 | 7.28 | 11.4425 | 12.9475 |
Citation
@misc{InstrcTrans-v2,
title={exaone3-instrucTrans-v2-enko-7.8b},
author={Yohan Na, Suzie Oh, Eunji Kim, Mingyou sung},
year={2024},
url={https://huggingface.co/Translation-EnKo/exaone3-instrucTrans-v2-enko-7.8b}
}
@misc{llama3modelcard,
title={Llama 3 Model Card},
author={AI@Meta},
year={2024},
url={https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
}
@article{exaone-3.0-7.8B-instruct,
title={EXAONE 3.0 7.8B Instruction Tuned Language Model},
author={LG AI Research},
journal={arXiv preprint arXiv:2408.03541},
year={2024}
}
@article{gemma_2024,
title={Gemma},
url={https://www.kaggle.com/m/3301},
DOI={10.34740/KAGGLE/M/3301},
publisher={Kaggle},
author={Gemma Team},
year={2024}
}
- Downloads last month
- 223
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.