yokhal-md / README.md
seonglae's picture
Update README.md
310ffaf verified
metadata
language:
  - ko
  - en
library_name: transformers
tags:
  - trl
  - sft
widget:
  - text: 안녕

Yokhal (욕쟁이 할머니)

Korean Chatbot based on Google Gemma

Model Details

Model Description

  • Fine-tuned by: Alan Jo
  • Model type: Gemma
  • Language(s) (NLP): Korean, English
  • Finetuned from model: Gemma-2b-it

Model Sources

Uses

Direct Use

Korean Chatbot with Internet culture

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.


tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16,
                                             device_map="auto" if device is None else device, 
                                             attn_implementation="flash_attention_2") # if flash enabled
sys_prompt = '한국어로 대답해'
texts = ['안녕', '서울은 오늘 어때']
chats = list(map(lambda t: [{'role': 'user', 'content': f'{sys_prompt}\n{t}'}], texts)) # ChatML format
prompts = list(map(lambda p: tokenizer.apply_chat_template(p, tokenize=False, add_generation_prompt=True), chats))
input_ids = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda" if device is None else device)
outputs = model.generate(**input_ids, max_new_tokens=100, repetition_penalty=1.05)
for output in outputs:
  print(tokenizer.decode(output, skip_special_tokens=True), end='\n\n')

Training Details

Trained on 2 x RTX3090

More Information on Github source code

Training Data

[More Information Needed]

Training Procedure

  1. Weight Initialized from Internet comments dataset
  2. Trained on Korean Namuwiki dataset until step 80000 (30000 step is on main branch because of repetition issue above there)
  • seq_length 1024 with dataset packing
  • batch 3 per device
  • lr 1e-5
  • optim adafactor
  1. Instruction tuning on Korean Instruction Dataset using QLoRa (not on main)
  • seq_length 2048
  • lr 2e-4

Preprocessing [optional]

Gemma do not support explicit system prompt in ChatML, so I trained putting system prompt before user message like below

if (chat[0]['role'] == 'system'):
  chat[1]['content'] = f"{chat[0]['content']}\n{chat[1]['content']}"
  chat = chat[1:]
try:
  prompt = tokenizer.apply_chat_template(chat, tokenize=False)

Source Code

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary