# Training This is the 10k steps English supervised-fine-tuning (SFT) model of GPT-J using SODA dataset for Chai Competition. - **Language:** English - **Finetuned from:** [EleutherAI / GPT-J](https://huggingface.co/EleutherAI/gpt-j-6b) - **Code:** [Open-Assistant/model/model_training](https://github.com/LAION-AI/Open-Assistant/tree/main/model/model_training) - **Dataset:** 10 percent from [SODA dataset](https://huggingface.co/datasets/allenai/soda) # Why OpenAssistant framework: - Easy to setup training with change config from dataset and model is all you need - Data processing available for almost popular conversation datasets: SODA, Vicuna, OpenAssistant, ... # Configuration: You need to add this to default config file `configs/config.yaml` data: ``` soda-only: datasets: - soda: fraction: 0.1 input_max_length: 1024 ``` gptj-chai: ``` dtype: fp16 log_dir: gptj-soda model_name: EleutherAI/gpt-j-6b output_dir: output/gptj-soda-chai max_length: 1024 warmup_steps: 100 gradient_checkpointing: true gradient_accumulation_steps: 1 per_device_train_batch_size: 8 per_device_eval_batch_size: 8 eval_steps: 5000 save_steps: 5000 num_train_epochs: 1 save_total_limit: 1 use_flash_attention: false ``` # Command to train: ```bash deepspeed trainer_sft.py --local_rank=0 --configs defaults gptj-chai soda-only --cache_dir data_cache --deepspeed ``` # Demo code: ```python from transformers import AutoTokenizer, AutoModelForCausalLM class ChatBot(): def __init__(self, path="/mnt/hdd/duyphung/gptj-soda-chai/checkpoint-10000/"): self.tokenizer = AutoTokenizer.from_pretrained(path) self.model = AutoModelForCausalLM.from_pretrained(path).half().cuda().eval() self.model.pad_token_id = self.tokenizer.eos_token_id self.tokenizer.pad_token_id = self.tokenizer.eos_token_id def chat(self, message): enc_dict = self.tokenizer( message, return_tensors='pt' ) for x in enc_dict: enc_dict[x] = enc_dict[x].cuda() chat_history_ids = self.model.generate( input_ids=enc_dict['input_ids'], attention_mask=enc_dict['attention_mask'], max_new_tokens=64, temperature=0.7, do_sample=True, top_k=0, top_p=0.95, ) out = chat_history_ids[:, enc_dict['input_ids'].shape[-1]:][0] return self.tokenizer.decode(out, skip_special_tokens=True) if __name__ == "__main__": bot_name = 'Bot:' prompt = "<|prompter|>" chat_history = [] bot = ChatBot() while True: message = input("Me: ") chat_history.append(f'Me: {message}') prompt = prompt + message + "<|endoftext|><|assistant|>" response = bot.chat(prompt) print(f'{bot_name} {response}') prompt = prompt + response + "<|endoftext|><|prompter|>" ```