Training

This is the 10k steps English supervised-fine-tuning (SFT) model of GPT-J using SODA dataset for Chai Competition.

Language: English
Finetuned from: EleutherAI / GPT-J
Code: Open-Assistant/model/model_training
Dataset: 10 percent from SODA dataset

Why OpenAssistant framework:

Easy to setup training with change config from dataset and model is all you need
Data processing available for almost popular conversation datasets: SODA, Vicuna, OpenAssistant, ...

Configuration:

You need to add this to default config file configs/config.yaml

data:

soda-only:
  datasets:
    - soda:
        fraction: 0.1
        input_max_length: 1024

gptj-chai:

  dtype: fp16
  log_dir: gptj-soda
  model_name: EleutherAI/gpt-j-6b
  output_dir: output/gptj-soda-chai
  max_length: 1024
  warmup_steps: 100
  gradient_checkpointing: true
  gradient_accumulation_steps: 1
  per_device_train_batch_size: 8
  per_device_eval_batch_size: 8
  eval_steps: 5000
  save_steps: 5000
  num_train_epochs: 1
  save_total_limit: 1
  use_flash_attention: false

Command to train:

deepspeed trainer_sft.py --local_rank=0 --configs defaults gptj-chai soda-only --cache_dir data_cache --deepspeed

Demo code:

from transformers import AutoTokenizer, AutoModelForCausalLM



class ChatBot():
    def __init__(self, path="/mnt/hdd/duyphung/gptj-soda-chai/checkpoint-10000/"):
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        self.model = AutoModelForCausalLM.from_pretrained(path).half().cuda().eval()
        self.model.pad_token_id = self.tokenizer.eos_token_id
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

    def chat(self, message):
        enc_dict = self.tokenizer(
            message,
            return_tensors='pt'
        )
        for x in enc_dict:
            enc_dict[x] = enc_dict[x].cuda()
        chat_history_ids = self.model.generate(
            input_ids=enc_dict['input_ids'],
            attention_mask=enc_dict['attention_mask'],
            max_new_tokens=64,
            temperature=0.7,
            do_sample=True,
            top_k=0,
            top_p=0.95,
        )
        out = chat_history_ids[:, enc_dict['input_ids'].shape[-1]:][0]
        return self.tokenizer.decode(out, skip_special_tokens=True)


if __name__ == "__main__":
    bot_name = 'Bot:'
    prompt = "<|prompter|>"
    chat_history = []

    bot = ChatBot()
    while True:
        message = input("Me: ")
        chat_history.append(f'Me: {message}')
        prompt = prompt + message + "<|endoftext|><|assistant|>"
        response = bot.chat(prompt)
        print(f'{bot_name} {response}')
        prompt = prompt + response + "<|endoftext|><|prompter|>"