metadata
language:
- ko
- en
library_name: transformers
tags:
- trl
- sft
widget:
- text: 안녕
Yokhal (욕쟁이 할머니)
Korean Chatbot based on Google Gemma
Model Details
Model Description
- Fine-tuned by: Alan Jo
- Model type: Gemma
- Language(s) (NLP): Korean, English
- Finetuned from model: Gemma-2b-it
Model Sources
Uses
Direct Use
Korean Chatbot with Internet culture
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16,
device_map="auto" if device is None else device,
attn_implementation="flash_attention_2") # if flash enabled
sys_prompt = '한국어로 대답해'
texts = ['안녕', '서울은 오늘 어때']
chats = list(map(lambda t: [{'role': 'user', 'content': f'{sys_prompt}\n{t}'}], texts)) # ChatML format
prompts = list(map(lambda p: tokenizer.apply_chat_template(p, tokenize=False, add_generation_prompt=True), chats))
input_ids = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda" if device is None else device)
outputs = model.generate(**input_ids, max_new_tokens=100, repetition_penalty=1.05)
for output in outputs:
print(tokenizer.decode(output, skip_special_tokens=True), end='\n\n')
Training Details
Trained on 2 x RTX3090
More Information on Github source code
Training Data
[More Information Needed]
Training Procedure
- Weight Initialized from Internet comments dataset
- Trained on Korean Namuwiki dataset until step 80000 (30000 step is on main branch because of repetition issue above there)
seq_length
1024 with dataset packingbatch
3 per devicelr
1e-5optim
adafactor
- Instruction tuning on Korean Instruction Dataset using QLoRa (not on main)
seq_length
2048lr
2e-4
Preprocessing [optional]
Gemma do not support explicit system prompt in ChatML, so I trained putting system prompt before user message like below
if (chat[0]['role'] == 'system'):
chat[1]['content'] = f"{chat[0]['content']}\n{chat[1]['content']}"
chat = chat[1:]
try:
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]