File size: 2,933 Bytes

# Training

This is the 10k steps English supervised-fine-tuning (SFT) model of GPT-J using SODA dataset for Chai Competition.

- **Language:** English
- **Finetuned from:** [EleutherAI / GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b)
- **Code:** [Open-Assistant/model/model_training](https://github.com/LAION-AI/Open-Assistant/tree/main/model/model_training)
- **Dataset:** 10 percent from [SODA dataset](https://huggingface.co/datasets/allenai/soda)

# Why OpenAssistant framework:
- Easy to setup training with change config from dataset and model is all you need
- Data processing available for almost popular conversation datasets: SODA, Vicuna, OpenAssistant, ...

# Configuration:

You need to add this to default config file `configs/config.yaml`

data:
```
soda-only:
  datasets:
    - soda:
        fraction: 0.1
        input_max_length: 1024
```

gptj-chai:
```
  dtype: fp16
  log_dir: gptj-soda
  model_name: EleutherAI/gpt-j-6b
  output_dir: output/gptj-soda-chai
  max_length: 1024
  warmup_steps: 100
  gradient_checkpointing: true
  gradient_accumulation_steps: 1
  per_device_train_batch_size: 8
  per_device_eval_batch_size: 8
  eval_steps: 5000
  save_steps: 5000
  num_train_epochs: 1
  save_total_limit: 1
  use_flash_attention: false
```

# Command to train:

```bash
deepspeed trainer_sft.py --local_rank=0 --configs defaults gptj-chai soda-only --cache_dir data_cache --deepspeed
```

# Demo code:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM



class ChatBot():
    def __init__(self, path="/mnt/hdd/duyphung/gptj-soda-chai/checkpoint-10000/"):
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        self.model = AutoModelForCausalLM.from_pretrained(path).half().cuda().eval()
        self.model.pad_token_id = self.tokenizer.eos_token_id
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

    def chat(self, message):
        enc_dict = self.tokenizer(
            message,
            return_tensors='pt'
        )
        for x in enc_dict:
            enc_dict[x] = enc_dict[x].cuda()
        chat_history_ids = self.model.generate(
            input_ids=enc_dict['input_ids'],
            attention_mask=enc_dict['attention_mask'],
            max_new_tokens=64,
            temperature=0.7,
            do_sample=True,
            top_k=0,
            top_p=0.95,
        )
        out = chat_history_ids[:, enc_dict['input_ids'].shape[-1]:][0]
        return self.tokenizer.decode(out, skip_special_tokens=True)


if __name__ == "__main__":
    bot_name = 'Bot:'
    prompt = "<|prompter|>"
    chat_history = []

    bot = ChatBot()
    while True:
        message = input("Me: ")
        chat_history.append(f'Me: {message}')
        prompt = prompt + message + "<|endoftext|><|assistant|>"
        response = bot.chat(prompt)
        print(f'{bot_name} {response}')
        prompt = prompt + response + "<|endoftext|><|prompter|>"
```