Update README.md
Browse files
README.md
CHANGED
@@ -36,4 +36,310 @@ Amazing option for further training. And this is a merge of the base, not the in
|
|
36 |
|
37 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/UK0_mQxy6GOHKxGKBbdhx.png)
|
38 |
|
39 |
-
I don't understand how the f a 150mb file can talk but it can
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/UK0_mQxy6GOHKxGKBbdhx.png)
|
38 |
|
39 |
+
I don't understand how the f a 150mb file can talk but it can
|
40 |
+
|
41 |
+
## π§ What's Really Going Down Here?
|
42 |
+
|
43 |
+
We're talking about a convergence of whole bunch of stuff, more papers will be written about this:
|
44 |
+
|
45 |
+
1. **Evolutionary Merging**:
|
46 |
+
2. **BitNet Integration**:
|
47 |
+
4. **Experimental GrokAdamW Optimizer**:
|
48 |
+
|
49 |
+
## Acknodledgements
|
50 |
+
|
51 |
+
Credits for optimizer go to [@cognitivecompai](https://github.com/cognitivecomputations/grokadamw) for laying the groundwork with the original GrokAdamW optimizer.
|
52 |
+
|
53 |
+
## LETS TRY OUT THE EXPERIMENTAL GROKKED FINETUNE:
|
54 |
+
|
55 |
+
```bash
|
56 |
+
wget https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base/resolve/main/biggie_groked_int8_q8_0.gguf
|
57 |
+
```
|
58 |
+
|
59 |
+
Yes we will be talking with a 164mb file that runs at 160 tokens per second on a single cpu core
|
60 |
+
## you read all of that correctly yes, 1 cpu core 160 tps https://x.com/nisten/status/1819752034305970649
|
61 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/nTNISjByBkN7bJZzuOvOw.png)
|
62 |
+
|
63 |
+
## π run it with NO GPU and only one CPU core it with these settings
|
64 |
+
```bash
|
65 |
+
./llama-cli -n -1 -fa -b 512 -ctv q8_0 -ctk q8_0 -fa --min-p 0.3 --top-p 0.85 --keep -1 -p "You are a NASA JPL Scientists. Human: I want to bring my cat to mars." -m biggie_groked_int8_q8_0.gguf -co -cnv --in-prefix "<|im_start|>Human:" --reverse-prompt "Human:" -c 1024 -n 512 --temp 1.5 -ngl 0
|
66 |
+
```
|
67 |
+
|
68 |
+
|
69 |
+
## ποΈ Training Tutorial, MAKE YOUR OWN BIGGIE_SMOlLM
|
70 |
+
|
71 |
+
|
72 |
+
Clone the repo like you're stealing code from the future:
|
73 |
+
```bash
|
74 |
+
git clone https://github.com/nisten/grokadamw
|
75 |
+
cd grokadamw
|
76 |
+
```
|
77 |
+
|
78 |
+
Fire up the training script and watch the magic happen:
|
79 |
+
```bash
|
80 |
+
python smoltrainer.py
|
81 |
+
```
|
82 |
+
|
83 |
+
## π» Do it from scratch yourself
|
84 |
+
Install the secret sauce (dependencies):
|
85 |
+
```bash
|
86 |
+
pip install torch transformers datasets tqdm
|
87 |
+
```
|
88 |
+
|
89 |
+
make a file named meow.py , copy paste in this code, and then run it ```python meow.py```
|
90 |
+
|
91 |
+
```python
|
92 |
+
import torch
|
93 |
+
import torch.nn as nn
|
94 |
+
import logging
|
95 |
+
from datasets import load_dataset, Dataset
|
96 |
+
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
|
97 |
+
from torch.cuda.amp import autocast
|
98 |
+
import warnings
|
99 |
+
from tqdm import tqdm
|
100 |
+
|
101 |
+
warnings.filterwarnings("ignore", category=FutureWarning)
|
102 |
+
warnings.filterwarnings("ignore", category=UserWarning)
|
103 |
+
|
104 |
+
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
105 |
+
logger = logging.getLogger(__name__)
|
106 |
+
|
107 |
+
MODEL_NAME = "nisten/Biggie-SmoLlm-0.15B-Base"
|
108 |
+
MAX_LENGTH = 2048
|
109 |
+
BATCH_SIZE = 12
|
110 |
+
LEARNING_RATE = 2e-4
|
111 |
+
MAX_STEPS = 3000
|
112 |
+
GRADIENT_ACCUMULATION_STEPS = 2
|
113 |
+
NUM_WARMUP_STEPS = 30
|
114 |
+
OUTPUT_DIR = "./capybara_finetuned_results"
|
115 |
+
|
116 |
+
torch.backends.cuda.matmul.allow_tf32 = True
|
117 |
+
torch.backends.cudnn.allow_tf32 = True
|
118 |
+
|
119 |
+
class GrokAdamW(torch.optim.Optimizer):
|
120 |
+
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2,
|
121 |
+
alpha_init=0.98, lamb=2.0, gamma=0.1, grokking_signal_fns=None,
|
122 |
+
grokking_signal_decay_rate=0.1, gradient_clipping=1.0):
|
123 |
+
defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
|
124 |
+
alpha_init=alpha_init, lamb=lamb, gamma=gamma,
|
125 |
+
grokking_signal_fns=grokking_signal_fns,
|
126 |
+
grokking_signal_decay_rate=grokking_signal_decay_rate,
|
127 |
+
gradient_clipping=gradient_clipping)
|
128 |
+
super(GrokAdamW, self).__init__(params, defaults)
|
129 |
+
|
130 |
+
@torch.no_grad()
|
131 |
+
def step(self, closure=None):
|
132 |
+
loss = None
|
133 |
+
if closure is not None:
|
134 |
+
with torch.enable_grad():
|
135 |
+
loss = closure()
|
136 |
+
|
137 |
+
for group in self.param_groups:
|
138 |
+
grokking_signal = self._compute_grokking_signal(group)
|
139 |
+
for i, p in enumerate(group['params']):
|
140 |
+
if p.grad is None:
|
141 |
+
continue
|
142 |
+
grad = p.grad
|
143 |
+
|
144 |
+
if group['gradient_clipping'] > 0:
|
145 |
+
grad = torch.clamp(grad, -group['gradient_clipping'], group['gradient_clipping'])
|
146 |
+
|
147 |
+
state = self.state[p]
|
148 |
+
|
149 |
+
if len(state) == 0:
|
150 |
+
state['step'] = 0
|
151 |
+
state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
|
152 |
+
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
|
153 |
+
state['grok_ema'] = torch.zeros_like(p, memory_format=torch.preserve_format)
|
154 |
+
|
155 |
+
exp_avg, exp_avg_sq, grok_ema = state['exp_avg'], state['exp_avg_sq'], state['grok_ema']
|
156 |
+
beta1, beta2 = group['betas']
|
157 |
+
|
158 |
+
state['step'] += 1
|
159 |
+
|
160 |
+
layer_beta1 = beta1 * (1 - group['gamma'])**i
|
161 |
+
|
162 |
+
alpha = group['alpha_init'] * torch.exp(torch.tensor(-group['grokking_signal_decay_rate'] * grokking_signal))
|
163 |
+
grok_ema.mul_(alpha).add_(grad, alpha=1 - alpha)
|
164 |
+
grok_grad = grad + group['lamb'] * grok_ema
|
165 |
+
|
166 |
+
exp_avg.mul_(layer_beta1).add_(grok_grad, alpha=1 - layer_beta1)
|
167 |
+
exp_avg_sq.mul_(beta2).addcmul_(grok_grad, grok_grad, value=1 - beta2)
|
168 |
+
|
169 |
+
denom = exp_avg_sq.sqrt().add_(group['eps'])
|
170 |
+
step_size = group['lr']
|
171 |
+
|
172 |
+
if group['weight_decay'] != 0:
|
173 |
+
p.data.mul_(1 - group['lr'] * group['weight_decay'])
|
174 |
+
|
175 |
+
p.addcdiv_(exp_avg, denom, value=-step_size)
|
176 |
+
|
177 |
+
return loss
|
178 |
+
|
179 |
+
def _compute_grokking_signal(self, group):
|
180 |
+
if group['grokking_signal_fns'] is None:
|
181 |
+
return 0.0
|
182 |
+
|
183 |
+
signals = []
|
184 |
+
for fn in group['grokking_signal_fns']:
|
185 |
+
try:
|
186 |
+
signal = fn()
|
187 |
+
if signal is not None:
|
188 |
+
signals.append(signal)
|
189 |
+
except Exception as e:
|
190 |
+
logger.warning(f"Error in grokking_signal_fn: {e}. Ignoring this function.")
|
191 |
+
|
192 |
+
if not signals:
|
193 |
+
return 0.0
|
194 |
+
|
195 |
+
return sum(signals) / len(signals)
|
196 |
+
|
197 |
+
def format_capybara_prompts(examples):
|
198 |
+
texts = []
|
199 |
+
for conversation in examples['conversation']:
|
200 |
+
formatted_text = ""
|
201 |
+
for turn in conversation:
|
202 |
+
if 'input' in turn:
|
203 |
+
formatted_text += f"Human: {turn['input']}\n\n"
|
204 |
+
if 'output' in turn:
|
205 |
+
formatted_text += f"Assistant: {turn['output']}\n\n"
|
206 |
+
texts.append(formatted_text.strip())
|
207 |
+
return {"text": texts}
|
208 |
+
|
209 |
+
class CustomTrainer(Trainer):
|
210 |
+
def __init__(self, *args, **kwargs):
|
211 |
+
super().__init__(*args, **kwargs)
|
212 |
+
self.grokking_signal = 0.0
|
213 |
+
|
214 |
+
def compute_loss(self, model, inputs, return_outputs=False):
|
215 |
+
labels = inputs.pop("labels")
|
216 |
+
outputs = model(**inputs)
|
217 |
+
logits = outputs.logits
|
218 |
+
shift_logits = logits[..., :-1, :].contiguous()
|
219 |
+
shift_labels = labels[..., 1:].contiguous()
|
220 |
+
loss_fct = nn.CrossEntropyLoss()
|
221 |
+
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
|
222 |
+
return (loss, outputs) if return_outputs else loss
|
223 |
+
|
224 |
+
def training_step(self, model, inputs):
|
225 |
+
model.train()
|
226 |
+
inputs = self._prepare_inputs(inputs)
|
227 |
+
|
228 |
+
with autocast(dtype=torch.bfloat16):
|
229 |
+
loss = self.compute_loss(model, inputs)
|
230 |
+
|
231 |
+
if self.args.gradient_accumulation_steps > 1:
|
232 |
+
loss = loss / self.args.gradient_accumulation_steps
|
233 |
+
|
234 |
+
loss.backward()
|
235 |
+
|
236 |
+
self.grokking_signal = loss.item()
|
237 |
+
|
238 |
+
return loss.detach()
|
239 |
+
|
240 |
+
def grokking_signal_fn():
|
241 |
+
return trainer.grokking_signal
|
242 |
+
|
243 |
+
def main():
|
244 |
+
logger.info(f"π Initializing {MODEL_NAME} finetuning with GrokAdamW")
|
245 |
+
|
246 |
+
try:
|
247 |
+
config = AutoConfig.from_pretrained(MODEL_NAME)
|
248 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
|
249 |
+
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)
|
250 |
+
except Exception as e:
|
251 |
+
logger.error(f"β Failed to load model or tokenizer: {str(e)}")
|
252 |
+
return
|
253 |
+
|
254 |
+
if tokenizer.pad_token is None:
|
255 |
+
tokenizer.pad_token = tokenizer.eos_token
|
256 |
+
model.config.pad_token_id = model.config.eos_token_id
|
257 |
+
|
258 |
+
logger.info("π Loading Capybara dataset")
|
259 |
+
try:
|
260 |
+
capybara_dataset = load_dataset("LDJnr/Capybara", split="train")
|
261 |
+
capybara_dataset = capybara_dataset.map(format_capybara_prompts, batched=True, remove_columns=capybara_dataset.column_names)
|
262 |
+
except Exception as e:
|
263 |
+
logger.error(f"β Failed to load Capybara dataset: {str(e)}")
|
264 |
+
return
|
265 |
+
|
266 |
+
logger.info(f"π Capybara dataset size: {len(capybara_dataset)}")
|
267 |
+
|
268 |
+
def tokenize_function(examples):
|
269 |
+
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=MAX_LENGTH)
|
270 |
+
|
271 |
+
logger.info("π’ Tokenizing dataset")
|
272 |
+
tokenized_dataset = capybara_dataset.map(tokenize_function, batched=True, remove_columns=capybara_dataset.column_names)
|
273 |
+
|
274 |
+
logger.info("ποΈ Setting up the training arguments")
|
275 |
+
training_args = TrainingArguments(
|
276 |
+
output_dir=OUTPUT_DIR,
|
277 |
+
num_train_epochs=3,
|
278 |
+
per_device_train_batch_size=BATCH_SIZE,
|
279 |
+
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
|
280 |
+
learning_rate=LEARNING_RATE,
|
281 |
+
weight_decay=0.01,
|
282 |
+
bf16=True,
|
283 |
+
logging_steps=10,
|
284 |
+
save_steps=300,
|
285 |
+
save_total_limit=10,
|
286 |
+
dataloader_num_workers=4,
|
287 |
+
warmup_steps=NUM_WARMUP_STEPS,
|
288 |
+
gradient_checkpointing=True,
|
289 |
+
evaluation_strategy="steps",
|
290 |
+
eval_steps=300,
|
291 |
+
max_steps=MAX_STEPS,
|
292 |
+
fp16=False,
|
293 |
+
optim="adamw_hf",
|
294 |
+
lr_scheduler_type="cosine",
|
295 |
+
load_best_model_at_end=True,
|
296 |
+
metric_for_best_model="loss",
|
297 |
+
greater_is_better=False,
|
298 |
+
)
|
299 |
+
|
300 |
+
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
301 |
+
|
302 |
+
optimizer = GrokAdamW(
|
303 |
+
model.parameters(),
|
304 |
+
lr=LEARNING_RATE,
|
305 |
+
betas=(0.9, 0.999),
|
306 |
+
eps=1e-8,
|
307 |
+
weight_decay=0.01,
|
308 |
+
alpha_init=0.98,
|
309 |
+
lamb=2.0,
|
310 |
+
gamma=0.1,
|
311 |
+
grokking_signal_fns=[grokking_signal_fn],
|
312 |
+
grokking_signal_decay_rate=0.1,
|
313 |
+
gradient_clipping=1.0
|
314 |
+
)
|
315 |
+
|
316 |
+
logger.info("πββοΈ Initializing Trainer with GrokAdamW")
|
317 |
+
global trainer
|
318 |
+
trainer = CustomTrainer(
|
319 |
+
model=model,
|
320 |
+
args=training_args,
|
321 |
+
train_dataset=tokenized_dataset,
|
322 |
+
eval_dataset=tokenized_dataset.select(range(min(1000, len(tokenized_dataset)))),
|
323 |
+
data_collator=data_collator,
|
324 |
+
optimizers=(optimizer, None),
|
325 |
+
)
|
326 |
+
|
327 |
+
logger.info("π₯ Starting the training with GrokAdamW")
|
328 |
+
try:
|
329 |
+
trainer.train()
|
330 |
+
except Exception as e:
|
331 |
+
logger.error(f"β Training failed: {str(e)}")
|
332 |
+
return
|
333 |
+
|
334 |
+
logger.info("πΎ Saving the model")
|
335 |
+
try:
|
336 |
+
trainer.save_model(OUTPUT_DIR)
|
337 |
+
except Exception as e:
|
338 |
+
logger.error(f"β Failed to save model: {str(e)}")
|
339 |
+
|
340 |
+
logger.info("π Finetuning with GrokAdamW completed!")
|
341 |
+
|
342 |
+
if __name__ == "__main__":
|
343 |
+
main()
|
344 |
+
```
|
345 |
+
Now go forth and train, accelerate that code ( you'll need about 22Gb of VRAM, change to batch size 1 for 8Gb) π§ͺπ
|