nisten commited on
Commit
e7bfb1f
β€’
1 Parent(s): 0b5e19d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +307 -1
README.md CHANGED
@@ -36,4 +36,310 @@ Amazing option for further training. And this is a merge of the base, not the in
36
 
37
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/UK0_mQxy6GOHKxGKBbdhx.png)
38
 
39
- I don't understand how the f a 150mb file can talk but it can
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/UK0_mQxy6GOHKxGKBbdhx.png)
38
 
39
+ I don't understand how the f a 150mb file can talk but it can
40
+
41
+ ## 🧠 What's Really Going Down Here?
42
+
43
+ We're talking about a convergence of whole bunch of stuff, more papers will be written about this:
44
+
45
+ 1. **Evolutionary Merging**:
46
+ 2. **BitNet Integration**:
47
+ 4. **Experimental GrokAdamW Optimizer**:
48
+
49
+ ## Acknodledgements
50
+
51
+ Credits for optimizer go to [@cognitivecompai](https://github.com/cognitivecomputations/grokadamw) for laying the groundwork with the original GrokAdamW optimizer.
52
+
53
+ ## LETS TRY OUT THE EXPERIMENTAL GROKKED FINETUNE:
54
+
55
+ ```bash
56
+ wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf
57
+ ```
58
+
59
+ Yes we will be talking with a 164mb file that runs at 160 tokens per second on a single cpu core
60
+ ## you read all of that correctly yes, 1 cpu core 160 tps https://x.com/nisten/status/1819752034305970649
61
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/nTNISjByBkN7bJZzuOvOw.png)
62
+
63
+ ## πŸš€ run it with NO GPU and only one CPU core it with these settings
64
+ ```bash
65
+ ./llama-cli -n -1 -fa -b 512 -ctv q8_0 -ctk q8_0 -fa --min-p 0.3 --top-p 0.85 --keep -1 -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." -m biggie_groked_int8_q8_0.gguf -co -cnv --in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" -c 1024 -n 512 --temp 1.5 -ngl 0
66
+ ```
67
+
68
+
69
+ ## πŸ‹οΈ Training Tutorial, MAKE YOUR OWN BIGGIE_SMOlLM
70
+
71
+
72
+ Clone the repo like you're stealing code from the future:
73
+ ```bash
74
+ git clone https://github.com/nisten/grokadamw
75
+ cd grokadamw
76
+ ```
77
+
78
+ Fire up the training script and watch the magic happen:
79
+ ```bash
80
+ python smoltrainer.py
81
+ ```
82
+
83
+ ## πŸ’» Do it from scratch yourself
84
+ Install the secret sauce (dependencies):
85
+ ```bash
86
+ pip install torch transformers datasets tqdm
87
+ ```
88
+
89
+ make a file named meow.py , copy paste in this code, and then run it ```python meow.py```
90
+
91
+ ```python
92
+ import torch
93
+ import torch.nn as nn
94
+ import logging
95
+ from datasets import load_dataset, Dataset
96
+ from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
97
+ from torch.cuda.amp import autocast
98
+ import warnings
99
+ from tqdm import tqdm
100
+
101
+ warnings.filterwarnings("ignore", category=FutureWarning)
102
+ warnings.filterwarnings("ignore", category=UserWarning)
103
+
104
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
105
+ logger = logging.getLogger(__name__)
106
+
107
+ MODEL_NAME = "nisten/Biggie-SmoLlm-0.15B-Base"
108
+ MAX_LENGTH = 2048
109
+ BATCH_SIZE = 12
110
+ LEARNING_RATE = 2e-4
111
+ MAX_STEPS = 3000
112
+ GRADIENT_ACCUMULATION_STEPS = 2
113
+ NUM_WARMUP_STEPS = 30
114
+ OUTPUT_DIR = "./capybara_finetuned_results"
115
+
116
+ torch.backends.cuda.matmul.allow_tf32 = True
117
+ torch.backends.cudnn.allow_tf32 = True
118
+
119
+ class GrokAdamW(torch.optim.Optimizer):
120
+ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2,
121
+ alpha_init=0.98, lamb=2.0, gamma=0.1, grokking_signal_fns=None,
122
+ grokking_signal_decay_rate=0.1, gradient_clipping=1.0):
123
+ defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
124
+ alpha_init=alpha_init, lamb=lamb, gamma=gamma,
125
+ grokking_signal_fns=grokking_signal_fns,
126
+ grokking_signal_decay_rate=grokking_signal_decay_rate,
127
+ gradient_clipping=gradient_clipping)
128
+ super(GrokAdamW, self).__init__(params, defaults)
129
+
130
+ @torch.no_grad()
131
+ def step(self, closure=None):
132
+ loss = None
133
+ if closure is not None:
134
+ with torch.enable_grad():
135
+ loss = closure()
136
+
137
+ for group in self.param_groups:
138
+ grokking_signal = self._compute_grokking_signal(group)
139
+ for i, p in enumerate(group['params']):
140
+ if p.grad is None:
141
+ continue
142
+ grad = p.grad
143
+
144
+ if group['gradient_clipping'] > 0:
145
+ grad = torch.clamp(grad, -group['gradient_clipping'], group['gradient_clipping'])
146
+
147
+ state = self.state[p]
148
+
149
+ if len(state) == 0:
150
+ state['step'] = 0
151
+ state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
152
+ state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
153
+ state['grok_ema'] = torch.zeros_like(p, memory_format=torch.preserve_format)
154
+
155
+ exp_avg, exp_avg_sq, grok_ema = state['exp_avg'], state['exp_avg_sq'], state['grok_ema']
156
+ beta1, beta2 = group['betas']
157
+
158
+ state['step'] += 1
159
+
160
+ layer_beta1 = beta1 * (1 - group['gamma'])**i
161
+
162
+ alpha = group['alpha_init'] * torch.exp(torch.tensor(-group['grokking_signal_decay_rate'] * grokking_signal))
163
+ grok_ema.mul_(alpha).add_(grad, alpha=1 - alpha)
164
+ grok_grad = grad + group['lamb'] * grok_ema
165
+
166
+ exp_avg.mul_(layer_beta1).add_(grok_grad, alpha=1 - layer_beta1)
167
+ exp_avg_sq.mul_(beta2).addcmul_(grok_grad, grok_grad, value=1 - beta2)
168
+
169
+ denom = exp_avg_sq.sqrt().add_(group['eps'])
170
+ step_size = group['lr']
171
+
172
+ if group['weight_decay'] != 0:
173
+ p.data.mul_(1 - group['lr'] * group['weight_decay'])
174
+
175
+ p.addcdiv_(exp_avg, denom, value=-step_size)
176
+
177
+ return loss
178
+
179
+ def _compute_grokking_signal(self, group):
180
+ if group['grokking_signal_fns'] is None:
181
+ return 0.0
182
+
183
+ signals = []
184
+ for fn in group['grokking_signal_fns']:
185
+ try:
186
+ signal = fn()
187
+ if signal is not None:
188
+ signals.append(signal)
189
+ except Exception as e:
190
+ logger.warning(f"Error in grokking_signal_fn: {e}. Ignoring this function.")
191
+
192
+ if not signals:
193
+ return 0.0
194
+
195
+ return sum(signals) / len(signals)
196
+
197
+ def format_capybara_prompts(examples):
198
+ texts = []
199
+ for conversation in examples['conversation']:
200
+ formatted_text = ""
201
+ for turn in conversation:
202
+ if 'input' in turn:
203
+ formatted_text += f"Human: {turn['input']}\n\n"
204
+ if 'output' in turn:
205
+ formatted_text += f"Assistant: {turn['output']}\n\n"
206
+ texts.append(formatted_text.strip())
207
+ return {"text": texts}
208
+
209
+ class CustomTrainer(Trainer):
210
+ def __init__(self, *args, **kwargs):
211
+ super().__init__(*args, **kwargs)
212
+ self.grokking_signal = 0.0
213
+
214
+ def compute_loss(self, model, inputs, return_outputs=False):
215
+ labels = inputs.pop("labels")
216
+ outputs = model(**inputs)
217
+ logits = outputs.logits
218
+ shift_logits = logits[..., :-1, :].contiguous()
219
+ shift_labels = labels[..., 1:].contiguous()
220
+ loss_fct = nn.CrossEntropyLoss()
221
+ loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
222
+ return (loss, outputs) if return_outputs else loss
223
+
224
+ def training_step(self, model, inputs):
225
+ model.train()
226
+ inputs = self._prepare_inputs(inputs)
227
+
228
+ with autocast(dtype=torch.bfloat16):
229
+ loss = self.compute_loss(model, inputs)
230
+
231
+ if self.args.gradient_accumulation_steps > 1:
232
+ loss = loss / self.args.gradient_accumulation_steps
233
+
234
+ loss.backward()
235
+
236
+ self.grokking_signal = loss.item()
237
+
238
+ return loss.detach()
239
+
240
+ def grokking_signal_fn():
241
+ return trainer.grokking_signal
242
+
243
+ def main():
244
+ logger.info(f"πŸš€ Initializing {MODEL_NAME} finetuning with GrokAdamW")
245
+
246
+ try:
247
+ config = AutoConfig.from_pretrained(MODEL_NAME)
248
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
249
+ model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)
250
+ except Exception as e:
251
+ logger.error(f"❌ Failed to load model or tokenizer: {str(e)}")
252
+ return
253
+
254
+ if tokenizer.pad_token is None:
255
+ tokenizer.pad_token = tokenizer.eos_token
256
+ model.config.pad_token_id = model.config.eos_token_id
257
+
258
+ logger.info("πŸ“š Loading Capybara dataset")
259
+ try:
260
+ capybara_dataset = load_dataset("LDJnr/Capybara", split="train")
261
+ capybara_dataset = capybara_dataset.map(format_capybara_prompts, batched=True, remove_columns=capybara_dataset.column_names)
262
+ except Exception as e:
263
+ logger.error(f"❌ Failed to load Capybara dataset: {str(e)}")
264
+ return
265
+
266
+ logger.info(f"πŸ“Š Capybara dataset size: {len(capybara_dataset)}")
267
+
268
+ def tokenize_function(examples):
269
+ return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH)
270
+
271
+ logger.info("πŸ”’ Tokenizing dataset")
272
+ tokenized_dataset = capybara_dataset.map(tokenize_function, batched=True, remove_columns=capybara_dataset.column_names)
273
+
274
+ logger.info("πŸ‹οΈ Setting up the training arguments")
275
+ training_args = TrainingArguments(
276
+ output_dir=OUTPUT_DIR,
277
+ num_train_epochs=3,
278
+ per_device_train_batch_size=BATCH_SIZE,
279
+ gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
280
+ learning_rate=LEARNING_RATE,
281
+ weight_decay=0.01,
282
+ bf16=True,
283
+ logging_steps=10,
284
+ save_steps=300,
285
+ save_total_limit=10,
286
+ dataloader_num_workers=4,
287
+ warmup_steps=NUM_WARMUP_STEPS,
288
+ gradient_checkpointing=True,
289
+ evaluation_strategy="steps",
290
+ eval_steps=300,
291
+ max_steps=MAX_STEPS,
292
+ fp16=False,
293
+ optim="adamw_hf",
294
+ lr_scheduler_type="cosine",
295
+ load_best_model_at_end=True,
296
+ metric_for_best_model="loss",
297
+ greater_is_better=False,
298
+ )
299
+
300
+ data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
301
+
302
+ optimizer = GrokAdamW(
303
+ model.parameters(),
304
+ lr=LEARNING_RATE,
305
+ betas=(0.9, 0.999),
306
+ eps=1e-8,
307
+ weight_decay=0.01,
308
+ alpha_init=0.98,
309
+ lamb=2.0,
310
+ gamma=0.1,
311
+ grokking_signal_fns=[grokking_signal_fn],
312
+ grokking_signal_decay_rate=0.1,
313
+ gradient_clipping=1.0
314
+ )
315
+
316
+ logger.info("πŸƒβ€β™‚οΈ Initializing Trainer with GrokAdamW")
317
+ global trainer
318
+ trainer = CustomTrainer(
319
+ model=model,
320
+ args=training_args,
321
+ train_dataset=tokenized_dataset,
322
+ eval_dataset=tokenized_dataset.select(range(min(1000, len(tokenized_dataset)))),
323
+ data_collator=data_collator,
324
+ optimizers=(optimizer, None),
325
+ )
326
+
327
+ logger.info("πŸ”₯ Starting the training with GrokAdamW")
328
+ try:
329
+ trainer.train()
330
+ except Exception as e:
331
+ logger.error(f"❌ Training failed: {str(e)}")
332
+ return
333
+
334
+ logger.info("πŸ’Ύ Saving the model")
335
+ try:
336
+ trainer.save_model(OUTPUT_DIR)
337
+ except Exception as e:
338
+ logger.error(f"❌ Failed to save model: {str(e)}")
339
+
340
+ logger.info("πŸŽ‰ Finetuning with GrokAdamW completed!")
341
+
342
+ if __name__ == "__main__":
343
+ main()
344
+ ```
345
+ Now go forth and train, accelerate that code ( you'll need about 22Gb of VRAM, change to batch size 1 for 8Gb) πŸ§ͺπŸš€