IlyaGusev commited on
Commit
c80b7c0
1 Parent(s): cc1bf6a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - IlyaGusev/ru_turbo_alpaca
4
+ - IlyaGusev/ru_turbo_saiga
5
+ - IlyaGusev/ru_sharegpt_cleaned
6
+ - IlyaGusev/oasst1_ru_main_branch
7
+ - IlyaGusev/ru_turbo_alpaca_evol_instruct
8
+ - lksy/ru_instruct_gpt4
9
+ language:
10
+ - ru
11
+ pipeline_tag: conversational
12
+ license: cc-by-4.0
13
+ ---
14
+
15
+ # Saiga2 7B, Russian LLaMA2-based chatbot
16
+
17
+ Based on [Mistral OpenOrca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca).
18
+
19
+ This is an adapter-only version.
20
+
21
+ Llama.cpp version: TBA
22
+
23
+ Colab: TBA
24
+
25
+ Training code: [link](https://github.com/IlyaGusev/rulm/tree/master/self_instruct).
26
+
27
+ ```python
28
+ from peft import PeftModel, PeftConfig
29
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
30
+
31
+ MODEL_NAME = "IlyaGusev/saiga_mistral_7b"
32
+ DEFAULT_MESSAGE_TEMPLATE = "<s>{role}\n{content}</s>"
33
+ DEFAULT_RESPONSE_TEMPLATE = "<s>bot\n"
34
+ DEFAULT_SYSTEM_PROMPT = "Ты — Сайга, русскоязычный автоматический ассистент. Ты разговариваешь с людьми и помогаешь им."
35
+
36
+ class Conversation:
37
+ def __init__(
38
+ self,
39
+ message_template=DEFAULT_MESSAGE_TEMPLATE,
40
+ system_prompt=DEFAULT_SYSTEM_PROMPT,
41
+ start_token_id=1,
42
+ ):
43
+ self.message_template = message_template
44
+ self.start_token_id = start_token_id
45
+ self.messages = [{
46
+ "role": "system",
47
+ "content": system_prompt
48
+ }]
49
+
50
+ def get_start_token_id(self):
51
+ return self.start_token_id
52
+
53
+ def get_bot_token_id(self):
54
+ return self.bot_token_id
55
+
56
+ def add_user_message(self, message):
57
+ self.messages.append({
58
+ "role": "user",
59
+ "content": message
60
+ })
61
+
62
+ def add_bot_message(self, message):
63
+ self.messages.append({
64
+ "role": "bot",
65
+ "content": message
66
+ })
67
+
68
+ def get_prompt(self, tokenizer):
69
+ final_text = ""
70
+ for message in self.messages:
71
+ message_text = self.message_template.format(**message)
72
+ final_text += message_text
73
+ final_text += tokenizer.decode([self.start_token_id, self.bot_token_id])
74
+ return final_text.strip()
75
+
76
+
77
+ def generate(model, tokenizer, prompt, generation_config):
78
+ data = tokenizer(prompt, return_tensors="pt")
79
+ data = {k: v.to(model.device) for k, v in data.items()}
80
+ output_ids = model.generate(
81
+ **data,
82
+ generation_config=generation_config
83
+ )[0]
84
+ output_ids = output_ids[len(data["input_ids"][0]):]
85
+ output = tokenizer.decode(output_ids, skip_special_tokens=True)
86
+ return output.strip()
87
+
88
+ config = PeftConfig.from_pretrained(MODEL_NAME)
89
+ model = AutoModelForCausalLM.from_pretrained(
90
+ config.base_model_name_or_path,
91
+ load_in_8bit=True,
92
+ torch_dtype=torch.float16,
93
+ device_map="auto"
94
+ )
95
+ model = PeftModel.from_pretrained(
96
+ model,
97
+ MODEL_NAME,
98
+ torch_dtype=torch.float16
99
+ )
100
+ model.eval()
101
+
102
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
103
+ generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
104
+ print(generation_config)
105
+
106
+ inputs = ["Почему трава зеленая?", "Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч"]
107
+ for inp in inputs:
108
+ conversation = Conversation()
109
+ conversation.add_user_message(inp)
110
+ prompt = conversation.get_prompt(tokenizer)
111
+
112
+ output = generate(model, tokenizer, prompt, generation_config)
113
+ print(inp)
114
+ print(output)
115
+ print()
116
+ print("==============================")
117
+ print()
118
+ ```
119
+
120
+ Examples:
121
+ ```
122
+ User: Почему трава зеленая?
123
+ Saiga: ```
124
+
125
+ ```
126
+ User: Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч
127
+ Saiga:
128
+ ```
129
+
130
+ - dataset code revision d0d123dd221e10bb2a3383bcb1c6e4efe1b4a28a
131
+ - wandb [link](https://wandb.ai/ilyagusev/rulm_self_instruct/runs/ip1qmm9p)
132
+ - 5 datasets: ru_turbo_saiga, ru_sharegpt_cleaned, oasst1_ru_main_branch, gpt_roleplay_realm, ru_instruct_gpt4
133
+ - Datasets merging script: [create_short_chat_set.py](https://github.com/IlyaGusev/rulm/blob/d0d123dd221e10bb2a3383bcb1c6e4efe1b4a28a/self_instruct/src/data_processing/create_short_chat_set.py)
134
+ - saiga_mistral_7b vs saiga2_13b: 243-31-141