File size: 2,947 Bytes
6f287a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1b5f04
6f287a0
b1b5f04
6f287a0
 
 
 
 
 
 
b1b5f04
6f287a0
b1b5f04
6f287a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95596fa
 
6f287a0
95596fa
6f287a0
6d7a231
6f287a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Training

This is the 10k steps English supervised-fine-tuning (SFT) model of GPT-J using SODA dataset for Chai Competition.

- **Language:** English
- **Finetuned from:** [EleutherAI / GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b)
- **Code:** [Open-Assistant/model/model_training](https://github.com/LAION-AI/Open-Assistant/tree/main/model/model_training)
- **Dataset:** 10 percent from [SODA dataset](https://huggingface.co/datasets/allenai/soda)

# Why OpenAssistant framework:
- Easy to setup training with change config from dataset and model is all you need
- Data processing available for almost popular conversation datasets: SODA, Vicuna, OpenAssistant, ...

# Configuration:

You need to add this to default config file `configs/config.yaml`


```
data:
soda-only:
  datasets:
    - soda:
        fraction: 0.1
        input_max_length: 1024
```


```
gptj-chai:
  dtype: fp16
  log_dir: gptj-soda
  model_name: EleutherAI/gpt-j-6b
  output_dir: output/gptj-soda-chai
  max_length: 1024
  warmup_steps: 100
  gradient_checkpointing: true
  gradient_accumulation_steps: 1
  per_device_train_batch_size: 8
  per_device_eval_batch_size: 8
  eval_steps: 5000
  save_steps: 5000
  num_train_epochs: 1
  save_total_limit: 1
  use_flash_attention: false
```

# Command to train:

```bash
deepspeed trainer_sft.py --local_rank=0 --configs defaults gptj-chai soda-only --cache_dir data_cache --deepspeed
```

# Interactive Demo Code:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM



class ChatBot():
    def __init__(self, path="/mnt/hdd/duyphung/gptj-soda-chai/checkpoint-10000/"):
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        self.model = AutoModelForCausalLM.from_pretrained(path).half().cuda().eval()
        self.model.pad_token_id = self.tokenizer.eos_token_id
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

    def chat(self, message):
        enc_dict = self.tokenizer(
            message,
            return_tensors='pt'
        )
        for x in enc_dict:
            enc_dict[x] = enc_dict[x].cuda()
        chat_history_ids = self.model.generate(
            input_ids=enc_dict['input_ids'],
            attention_mask=enc_dict['attention_mask'],
            max_new_tokens=64,
            temperature=0.7,
            do_sample=True,
            top_k=0,
            top_p=0.95,
        )
        out = chat_history_ids[:, enc_dict['input_ids'].shape[-1]:][0]
        return self.tokenizer.decode(out, skip_special_tokens=True)


if __name__ == "__main__":
    bot_name = 'Bot:'
    prompt = "<|prompter|>"
    chat_history = []

    bot = ChatBot()
    while True:
        message = input("Me: ")
        chat_history.append(f'Me: {message}')
        prompt = prompt + message + "<|endoftext|><|assistant|>"
        response = bot.chat(prompt)
        print(f'{bot_name} {response}')
        prompt = prompt + response + "<|endoftext|><|prompter|>"
```