|
--- |
|
language: |
|
- en |
|
- ko |
|
pipeline_tag: text-generation |
|
tags: |
|
- finetuned |
|
|
|
--- |
|
# komt : korean multi task instruction tuning model |
|
![multi task instruction tuning.jpg](https://github.com/davidkim205/komt/assets/16680469/c7f6ade7-247e-4b62-a94f-47e19abea68e) |
|
|
|
Recently, due to the success of ChatGPT, numerous large language models have emerged in an attempt to catch up with ChatGPT's capabilities. |
|
However, when it comes to Korean language performance, it has been observed that many models still struggle to provide accurate answers or generate Korean text effectively. |
|
This study addresses these challenges by introducing a multi-task instruction technique that leverages supervised datasets from various tasks to create training data for Large Language Models (LLMs). |
|
|
|
## Model Details |
|
|
|
* **Model Developers** : davidkim(changyeon kim) |
|
* **Repository** : https://github.com/davidkim205/komt |
|
* **Model Architecture** : The komt-mistral-7b-v1 is is a fine-tuned version of the Mistral-7B-Instruct-v0.1. |
|
|
|
## Dataset |
|
korean multi-task instruction dataset |
|
|
|
## Hardware and Software |
|
- nvidia driver : 535.54.03 |
|
- CUDA Version: 12.2 |
|
|
|
## Training |
|
Refer https://github.com/davidkim205/komt |
|
|
|
## Usage |
|
``` |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from transformers import TextStreamer, GenerationConfig |
|
|
|
model_name='davidkim205/komt-mistral-7b-v1' |
|
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
streamer = TextStreamer(tokenizer) |
|
|
|
def gen(x): |
|
generation_config = GenerationConfig( |
|
temperature=0.8, |
|
top_p=0.8, |
|
top_k=100, |
|
max_new_tokens=1024, |
|
early_stopping=True, |
|
do_sample=True, |
|
) |
|
q = f"[INST]{x} [/INST]" |
|
gened = model.generate( |
|
**tokenizer( |
|
q, |
|
return_tensors='pt', |
|
return_token_type_ids=False |
|
).to('cuda'), |
|
generation_config=generation_config, |
|
pad_token_id=tokenizer.eos_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
streamer=streamer, |
|
) |
|
result_str = tokenizer.decode(gened[0]) |
|
|
|
start_tag = f"\n\n### Response: " |
|
start_index = result_str.find(start_tag) |
|
|
|
if start_index != -1: |
|
result_str = result_str[start_index + len(start_tag):].strip() |
|
return result_str |
|
|
|
print(gen('μ μ£Όλλ₯Ό 1λ°2μΌλ‘ νΌμ μ¬ννλ €κ³ νλλ° μ¬ν μ½μ€λ₯Ό λ§λ€μ΄μ€')) |
|
``` |
|
output |
|
``` |
|
<s> [INST]μ μ£Όλλ₯Ό 1λ°2μΌλ‘ νΌμ μ¬ννλ €κ³ νλλ° μ¬ν μ½μ€λ₯Ό λ§λ€μ΄μ€ [/INST]1λ° 2μΌ μ¬ν μ½μ€ |
|
μ μ£Ό μ¬ν μ½μ€λ λ€μκ³Ό κ°μ΅λλ€: |
|
|
|
첫째 λ : |
|
* μμΉ¨: μ μ£Όμ μ λΌλ‘ μ΄λνμ¬ νλΌμ° κ΅λ¦½κ³΅μμ λμ°©ν©λλ€. μ€λ₯Έμͺ½ λ°μκ΅ κ΄μ₯μμ λμ λλ λ§₯μ£Όμμ μ°Ύμ ν λ³μ λ§₯μ£Όλ₯Ό λ§μκ³ κ³λ¨μΌλ‘ μ¬λΌκ° νλΌμ° μ μμ λμ°©ν©λλ€. |
|
* μ μ¬: μ μμμ κ³λ¨μ λ΄λ €μ μ€λμ μμΉν 골λͺ©μμ μμμ μ¦κΉλλ€. μ κ°, κ³°ν, μ 볡 λ± λ§μλ μ μ£Ό νΉμ°λ¬Όμ λ§λ³΄λ©° νλ§μ μ¦κΉλλ€. |
|
* μ€ν: μ€λμμ λ¨ννμ¬ λ¨μμͺ½ μ μ£Όλ‘ μ΄λν©λλ€. λ¨μμͺ½ μ μ£Όλ μ¬μ μμ° κ²½κ΄κ³Ό λ°±μ¬μ₯, μλ¦λ€μ΄ λ°λ€ λ± μλ¦λ€μ΄ νκ΄μ΄ μλ κ³³μ
λλ€. μμμ μλ°ν©λλ€. |
|
|
|
λμ§Έ λ : |
|
* μμΉ¨: λ¨μμͺ½ μ μ£Όμμ λΆμμͺ½ μ μ£Όλ‘ μ΄λν©λλ€. μ΄ μ§μμ νΈλ₯Έ μλ ν΄λ³κ³Ό ν¬λͺ
ν λ°λ€κ° νΌμ³μ Έ μλ μλ¦λ€μ΄ νκ΄μ
λλ€. μμμμ μμΉ¨μ λ¨Ήκ³ λ°λ€λ‘ ν₯νμ¬ ν΄λ³μμ ν΄μμ μ·¨ν©λλ€. |
|
* μ μ¬: λΆμμͺ½ μ μ£Όμ μλ°λ€μμ μμν λ°λ€λ₯Ό 보며 ν λΌμ ν΄μ°λ¬Όμ λ§λ³΄κ³ κ³μ μ λ°λΌ ν΄μ°λ¬Ό μ리λ₯Ό μ¦κΉλλ€. |
|
* μ€ν: λ¨μμͺ½ μ μ£Όμμ μ΄λνμ¬ μμͺ½ μ μ£Όλ‘ μ΄λν©λλ€. μμͺ½ μ μ£Όλ μ λ²½κ³Ό μμ ν΄λ³, μμκ»λΌμ λ μ μ λ± λ
νΉν κ²½κ΄μ΄ μλ κ³³μ
λλ€. μ΄κ³³μμλ μμͺ½ μ μ£Όμ λνμ μΈ λͺ
μμΈ μ²λμ¬λ₯Ό λ°©λ¬Ένκ³ μμͺ½ μ μ£Όμ μλ¦λ€μ΄ νκ΄μ κ°μν©λλ€. |
|
* μ λ
: μμͺ½ μ μ£Όμμ μ μ£Ό μλ‘ μ΄λνμ¬ ν λΌμ μ μ£Ό νΉμ°λ¬Όμ λ§λ³΄κ³ λμ°©ν μ μ£Ό λμ¬μμ μ λ
μ μ¦κΉλλ€. |
|
* μΌκ°: μ μ£Ό μμ λμ¬μμ μΌκ° νλμ μ¦κΈ°λ©° 1λ° 2μΌμ μ¬νμ λ§λ¬΄λ¦¬ν©λλ€. |
|
|
|
μ΄λ κ² μ μ£Όλ₯Ό 1λ° 2μΌλ‘ νΌμ μ¬ννλ©΄ μ μ£Όμ μλ¦λ€μ΄ νκ΄, νΈλ₯Έ μλ ν΄λ³, ν¬λͺ
ν λ°λ€ λ±μ κ²½νν μ μμ΅λλ€. |
|
``` |
|
## Evaluation |
|
For objective model evaluation, we initially used EleutherAI's lm-evaluation-harness but obtained unsatisfactory results. Consequently, we conducted evaluations using ChatGPT, a widely used model, as described in [Self-Alignment with Instruction Backtranslation](https://arxiv.org/pdf/2308.06502.pdf) and [Three Ways of Using Large Language Models to Evaluate Chat](https://arxiv.org/pdf/2308.06259.pdf) . |
|
|
|
|
|
| model | score | average(0~5) | percentage | |
|
| --------------------------------------- |---------| ------------ | ---------- | |
|
| gpt-3.5-turbo(close) | 147 | 3.97 | 79.45% | |
|
| naver Cue(close) | 140 | 3.78 | 75.67% | |
|
| clova X(close) | 136 | 3.67 | 73.51% | |
|
| WizardLM-13B-V1.2(open) | 96 | 2.59 | 51.89% | |
|
| Llama-2-7b-chat-hf(open) | 67 | 1.81 | 36.21% | |
|
| Llama-2-13b-chat-hf(open) | 73 | 1.91 | 38.37% | |
|
| nlpai-lab/kullm-polyglot-12.8b-v2(open) | 70 | 1.89 | 37.83% | |
|
| kfkas/Llama-2-ko-7b-Chat(open) | 96 | 2.59 | 51.89% | |
|
| beomi/KoAlpaca-Polyglot-12.8B(open) | 100 | 2.70 | 54.05% | |
|
| **komt-llama2-7b-v1 (open)(ours)** | **117** | **3.16** | **63.24%** | |
|
| **komt-llama2-13b-v1 (open)(ours)** | **129** | **3.48** | **69.72%** | |
|
| **komt-llama-30b-v1 (open)(ours)** | **129** | **3.16** | **63.24%** | |
|
| **komt-mistral-7b-v1 (open)(ours)** | **131** | **3.54** | **70.81%** | |
|
|
|
|