ayoolaolafenwa commited on
Commit
ec6bf39
1 Parent(s): 76b9366

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md CHANGED
@@ -1,3 +1,123 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ ## ChatLM
5
+ It is a chat Large Language model finetuned with pretrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b)
6
+ and trained on [chat-bot-instructions prompts dataset](https://huggingface.co/datasets/ayoolaolafenwa/sft-data).
7
+ ChatLM was trained on a dataset containing normal day to day human conversations, due to limited data used in training
8
+ it is not suitable for tasks like coding and current affairs.
9
+
10
+ ## Load Model in bfloatfp16
11
+ ``` python
12
+ import torch
13
+ from transformers import AutoModelForCausalLM, AutoTokenizer
14
+
15
+ model_path = "ayoolaolafenwa/ChatLM"
16
+
17
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
18
+
19
+ model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
20
+ torch_dtype=torch.bfloat16)
21
+
22
+ prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
23
+
24
+ tokens = tokenizer(prompt, return_tensors="pt")
25
+
26
+ token_ids = tokens.input_ids
27
+ attention_mask=tokens.attention_mask
28
+
29
+ token_ids = token_ids.to(model.device)
30
+ attention_mask=attention_mask.to(model.device)
31
+
32
+ outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
33
+ num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
34
+
35
+ output_text = tokenizer.decode(outputs[0])
36
+ output_text = output_text.replace("<|endoftext|>", "")
37
+
38
+ print(output_text)
39
+ ```
40
+
41
+ ## Load Model in bfloat16 and int8
42
+ ``` python
43
+ import torch
44
+ from transformers import AutoModelForCausalLM, AutoTokenizer
45
+
46
+ model_path = "ayoolaolafenwa/ChatLM"
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
49
+
50
+ model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
51
+ torch_dtype=torch.bfloat16, load_in_8bit=True)
52
+
53
+ prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
54
+
55
+ tokens = tokenizer(prompt, return_tensors="pt")
56
+
57
+ token_ids = tokens.input_ids
58
+ attention_mask=tokens.attention_mask
59
+
60
+ token_ids = token_ids.to(model.device)
61
+ attention_mask=attention_mask.to(model.device)
62
+
63
+ outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
64
+ num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
65
+
66
+ output_text = tokenizer.decode(outputs[0])
67
+ output_text = output_text.replace("<|endoftext|>", "")
68
+
69
+ print(output_text)
70
+ ```
71
+ ## Training procedure for Supervised Finetuning
72
+
73
+ Chatbot Instructions prompts dataset from https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts/viewer/alespalla--chatbot_instruction_prompts
74
+ was processed into a supervised finetuning for training a user prompt and corresponding response.
75
+
76
+ ##### Download Data
77
+ ``` python
78
+ from datasets import load_dataset
79
+
80
+ dataset = load_dataset("alespalla/chatbot_instruction_prompts", split = "train")
81
+ dataset.save_to_disk('ChatBotInsP')
82
+ dataset.to_csv('CIPtrain.csv')
83
+ ```
84
+
85
+ ##### Code to process dataset into Supervised finetuning format
86
+ ``` python
87
+ # Import pandas library
88
+ import pandas as pd
89
+
90
+ # Read the text dataset from csv file
91
+ text_data = pd.read_csv("CIPtrain.csv")
92
+
93
+ # Create empty lists for prompts and responses
94
+ prompts = []
95
+ responses = []
96
+
97
+ # Loop through the text data
98
+ for i in range(len(text_data)):
99
+ # Get the sender, message, and timestamp of the current row
100
+ prompt = text_data["prompt"][i]
101
+ prompt = str(prompt)
102
+
103
+ response = text_data["response"][i]
104
+ response = str(response)
105
+
106
+ # Add the message to the prompts list with <user> tag
107
+ prompts.append("<user>: " + prompt)
108
+ #elif sender == "bot":
109
+ # Add the message to the responses list with <chatbot> tag
110
+ responses.append("<chatbot>: " + response)
111
+
112
+ # Create a new dataframe with prompts and responses columns
113
+ new_data = pd.DataFrame({"prompt": prompts, "response": responses})
114
+
115
+ #alespalla/chatbot_instruction_prompts
116
+ # Write the new dataframe to a csv file
117
+ new_data.to_csv("MyData/chatbot_instruction_prompts_train.csv", index=False)
118
+ ```
119
+ I appended the user's prompts in the dataset with the tag <user> and the response with the tag <chatbot>.
120
+ Check the the modified dataset https://huggingface.co/datasets/ayoolaolafenwa/sft-data .
121
+
122
+ ChatLM was trained with preatrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b) and finetuned on the prepared supervised
123
+ dataset on a single H100 GPU. Check the full code for training on its github repository https://github.com/ayoolaolafenwa/ChatLM/tree/main