🏥 Ko-Med-Gemma-2-9b-it

⚠️ The output of this model should not be considered as professional medical advice, diagnosis, or treatment. For accurate diagnosis and treatment of any specific medical issue, please consult a qualified physician or healthcare professional. Also, the commercial usage of this model is prohibited.

🚀 Training

We did continued fine-tuning using rtzr/ko-gemma-2-9b-it as a base model. We trained 1.5 epoch on our dataset and the training time was about 65 hours. 1 A100-80G NVIDIA GPU is used.

🎁 Dataset

~~The dataset in the tag is not all of our training data~~

Our dataset consists of 449,500 data in total, including English/Korean medical data and English/Korean general domain data. It includes both single-turn and multi-turn, and includes doctor-patient conversations as well as medical exam QA with reasoning.

Korean medical data includes some that are translated from English medical data, while others are not.

🏆 Evaluation

KorMedMCQA

[KorMedMCQA (한국어 의학 벤치마크) Dataset]

We have uploaded the full results (including the exact outputs) in Google Drive [here].

Method

We followed the lm-eval direct generation method proposed in the original paper, but modified a little bit.

See my modified lm-evaluation-harness repo [here].

Since there are many (relatively) easy problems in the nurse category so that the final average score tends to be high, weight_by_size was set to false during mean aggregation.
Change the few-shot split from 'dev' to 'fewshot'
Add 'dentist' category.
Add multiple 'eos' tokens for generation_kwargs. (Since various recent models use different eos tokens.)

Note

Due to the batch inference issue of gemma-2 models here, we used batch_size=1. (Also, The batch size is automatically 1 for closed-sourced models through API. (except openAI batch API)) We hope that there is no big difference from the case where batch_size=8.
Other settings: num_fewshot=5, seed=42, 1 A100-80G NVIDIA GPU.
(WANT) TODO: Do for other random seeds twice more and calculate average as the final score.

Results (5-shots, Direct Generation)

Model	Doctor	Dentist	Nurse	Pharm	Avg
Closed Source
gpt-4o-2024-08-06 †	85.75	78.91	91.00	85.65	85.33
gpt-4o-2024-05-13 †	85.98	60.67 ‡	84.97	84.18	78.95
gpt-4o-mini-2024-07-18 †	66.20	61.16	79.50	69.94	69.20
HyperCLOVA X HCX-003 §	49.89	50.31	72.55	62.26	58.75
HyperCLOVA X HCX-DASH-001 §	43.22	42.42	60.02	47.80	48.36
solar-1-mini-chat-240612 §	43.45	39.21	57.52	46.33	46.63
Gemma-2 9B Family
ChuGyouk/ko-med-gemma-2-9b-it-merge2 (Ours, Merged)	57.47	56.60	76.42	68.36	64.71
ChuGyouk/ko-med-gemma-2-9b-it-merge1 (Ours, merged)	57.93	55.86	75.06	68.93	64.44
ChuGyouk/ko-med-gemma-2-9b-it-base (Ours, Base)	57.47	55.24	76.08	68.81	64.40
rtzr/ko-gemma-2-9b-it	54.02	53.14	73.46	64.63	61.32
google/gemma-2-9b-it	52.41	52.90	73.58	64.29	60.80
Korean Models
LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct	47.13	46.98	69.36	56.27	54.93
yanolja/EEVE-Korean-Instruct-10.8B-v1.0	44.37	41.18	66.63	57.29	52.37
upstage/SOLAR-10.7B-Instruct-v1.0	40.23	36.13	55.58	52.88	46.21
Multilingual Models
Qwen/Qwen2-7B-Instruct	44.60	45.75	65.95	57.74	53.51
meta-llama/Meta-Llama-3.1-8B-Instruct	39.77	41.68	59.34	56.61	49.35
>10B Models
google/gemma-2-27b-it (27.2B)	58.85	56.47	79.27	71.86	66.61
CohereForAI/c4ai-command-r-08-2024 (32.3B)	63.91	53.14	75.28	69.38	65.43
mistralai/Mistral-Nemo-Instruct-2407 (12.2B)	42.53	44.51	66.17	56.38	52.40

† : For the answers of GPT, we received a lot of responses in the form of "정답: A", "정답은 B", "정답은 C 입니다". We manually changed these answers to simply A, B, and C, and then re-measured the score.

‡ : For the dentist results of gpt-4o-2024-05-13, responses like "정답을 제공해드리겠습니다:" are observed quite frequently, so the actual score is expected to be slightly higher.

📚 Example

Python Code

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "ChuGyouk/ko-med-gemma-2-9b-it-merge2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()
# KorMedMCQA Doctor Test 2022 2nd period Question 61, Answer is B, 세균수막염
input_prompt = "다음 의학 시험 문제에 대해 보기 A~E 중 정답을 선택하세요. 선택 이유를 구체적으로 제공하세요."
question = "6세 여아가 하루 전부터 머리가 아프고 구토를 하여 병원에 왔다. 혈압 100/60 mmHg, 맥박 110회/분, 호흡 25회/분, 체온 38.7℃이다. 졸려하고, 목경직과 커니그(Kernig) 징후는 양성이지만, 뇌신경마비나 부분신경 징후는 없다. 검사 결과는 다음과 같다. 진단은?\n혈액: 혈색소 13.8g/dL, 백혈구 14,200/mm^3, 혈소판 135,000/mm^3 총단백질 7.4 g/dL, 알부민 4.3 g/dL, 포도당 105 mg/dL, C-반응단백질 120 mg/L (참고치, <10) 프로칼시토닌 90 ng/mL (참고치, 0.00-0.49) 결핵특이 인터페론감마(interferon-γ) 방출측정 음성소변: 적혈구 5～10/고배율시야, 백혈구 20～30/고배율시야, 뇌척수액: 압력 240 mmH2O, 백혈구 650/mm^3 (다형핵백혈구 90%, 림프구 10%), 단백질 112 mg/dL, 포도당 35 mg/dL, 올리고클론띠 음성 아데노신탈아미노효소활성도(ADA) 4.1 U/L (참고치, <10)"
A = "결핵수막염"
B = "세균수막염"
C = "자가면역뇌염"
D = "바이러스수막염"
E = "급성파종뇌척수염"
instruction = f"{input_prompt}\n\n{question}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nE. {E}"

messages = [
    {
        "role": "user", 
        "content": instruction
    }
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<end_of_turn>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.7,
    top_p=1.0,
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

Output

정답은 B입니다. 세균수막염.

다음은 각 선택지에 대한 설명입니다:

A. 결핵수막염: 결핵특이 인터페론-감마 방출측정이 음성이므로 결핵수막염은 가능성이 낮습니다.

B. 세균수막염: 환자는 고열, 구토, 두통, 목경직, 커니그 징후 양성 등 세균수막염의 전형적인 증상을 보입니다. 혈액 검사에서 백혈구 증가가 나타나며, 뇌척수액 검사에서 백혈구 수가 높고(650/mm3), 다형핵백혈구가 차지하는 비율이 높으며(90%), 포도당 수치가 낮고(35 mg/dL), 단백질 수치가 높습니다(112 mg/dL). 이러한 소견은 세균성 염증을 나타냅니다.

C. 자가면역뇌염: 자가면역뇌염은 일반적으로 뇌신경마비나 부분신경 징후를 동반하지만, 이 환자는 이러한 증상을 보이지 않습니다. 또한, 뇌척수액 검사에서 ADA 활성도가 정상 범위에 있으므로 자가면역뇌염은 가능성이 낮습니다.

D. 바이러스수막염: 바이러스수막염은 일반적으로 백혈구 수치가 낮고(200/mm3 이하), 단백질 수치가 낮으며, 포도당 수치는 정상 또는 약간 낮을 수 있습니다. 이 환자는 세균성 염증의 증거가 있으며, 바이러스 침습을 나타내는 특이적인 소견이 없습니다.

E. 급성파종뇌척수염: 급성파종뇌척수염은 일반적으로 다양한 신경학적 결손을 동반하며, 뇌척수액에서 올리고클론띠가 나타납니다. 이 환자는 뇌신경마비나 부분신경 징후가 없으며, 올리고클론띠도 나타나지 않습니다.

결론적으로, 환자의 임상 증상과 뇌척수액 검사 결과는 세균성 수막염을 가장 가능성이 높은 진단으로 지지합니다.

Other Example

# "타이레놀을 먹었는데도 불구하고 두통이 너무 심해요. 어떻게 해야 하나요?"

두통이 심하시다니 안타깝습니다. 타이레놀을 복용했음에도 불구하고 두통이 지속된다면, 다음과 같은 방법을 고려해 볼 수 있습니다:

1. **충분한 휴식과 수분 섭취**: 두통이 심할 때는 조용한 곳에서 휴식을 취하고 충분한 수분을 섭취하는 것이 도움이 될 수 있습니다.

2. **냉찜질**: 얼음팩이나 차가운 물로 적신 수건을 이마나 목에 대어 냉찜질을 해보세요.

3. **카페인 섭취**: 적당량의 카페인은 두통을 완화시킬 수 있습니다. 커피나 차를 조금 마셔보세요.

4. **스트레칭**: 목과 어깨를 부드럽게 스트레칭하여 긴장을 풀어보세요.

5. **다른 진통제**: 타이레놀 외에 이부프로펜이나 나프록센과 같은 다른 진통제를 복용해 볼 수 있습니다. 하지만 복용 전에 반드시 약사나 의사와 상담하세요.

6. **의사 상담**: 두통이 지속되거나 심해진다면, 의사와 상담하여 원인을 파악하고 적절한 치료를 받는 것이 중요합니다.

두통의 원인은 다양할 수 있으므로, 정확한 진단과 치료를 위해 의료 전문가의 도움을 받는 것이 가장 좋습니다.

MergeKit Details

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the DARE TIES merge method using google/gemma-2-9b-it as a base.

Models Merged

The following models were included in the merge:

Configuration

The following YAML configuration was used to produce this model:

models:
  - model: google/gemma-2-9b-it
  # No parameters necessary for base model
  - model: rtzr/ko-gemma-2-9b-it
    parameters:
      density: 0.53
      weight: 0.4
  - model: ChuGyouk/ko-med-gemma-2-9b-it-base
    parameters:
      density: 0.53
      weight: 0.6
merge_method: dare_ties
base_model: google/gemma-2-9b-it
parameters:
  int8_mask: true
dtype: bfloat16

Configuration for ChuGyouk/ko-med-gemma-2-9b-it-merge1 Model

models:
  - model: rtzr/ko-gemma-2-9b-it
  - model: ChuGyouk/ko-med-gemma-2-9b-it-base
    parameters:
      density: 0.5
      weight: 0.5
merge_method: dare_ties
base_model: rtzr/ko-gemma-2-9b-it
parameters:
  int8_mask: true
dtype: bfloat16

Contact

kyouwook@kaist.ac.kr
magicwho@unist.ac.kr

ChuGyouk
/

ko-med-gemma-2-9b-it-merge2