File size: 58,308 Bytes
ecb359f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
main: llamafile version 0.8.9
main: seed  = 1721530767
llama_model_loader: loaded meta data with 37 key-value pairs and 75 tensors from TinyLLama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.name str              = Apache License Version 2.0, January 2004
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/datasets/choos...
llama_model_loader: - kv  11:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  13:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  14:                    general.source.repo_url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  15:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  18:                          llama.block_count u32              = 8
llama_model_loader: - kv  19:                       llama.context_length u32              = 2048
llama_model_loader: - kv  20:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  21:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  22:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                          general.file_type u32              = 1
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type  f16:   58 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_rot            = 4
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 4
llm_load_print_meta: n_embd_head_v    = 4
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 256
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 4.62 M
llm_load_print_meta: model size       = 8.82 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = TinyLLama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =     8.82 MiB
..............
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     1.00 MiB
llama_new_context_with_model: KV self size  =    1.00 MiB, K (f16):    0.50 MiB, V (f16):    0.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    62.75 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 hello world the gruff man said they had a big heart. The man smiled and said he would be the best friends.
The man said no. He said he didn't want to take a nap and he was too sad.
The man was so sad he started to cry. He didn't know what the man was one of his friends.
The man saw that the man was not the children. He felt sad and said he wanted to come back. The man was mad and he was so sad.
The man felt bad for the people, and they both felt bad for the story.
The man told the kids about the end. He said he had to stay too slow and not take them. He was sad, but it was too late.
The people were very sad, but they knew it was not nice. They were never seen again. [end of text]


llama_print_timings:        load time =      10.61 ms
llama_print_timings:      sample time =       5.92 ms /   172 runs   (    0.03 ms per token, 29073.70 tokens per second)
llama_print_timings: prompt eval time =       2.03 ms /     8 tokens (    0.25 ms per token,  3942.83 tokens per second)
llama_print_timings:        eval time =     245.61 ms /   171 runs   (    1.44 ms per token,   696.21 tokens per second)
llama_print_timings:       total time =     292.61 ms /   179 tokens
Log end
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
main: llamafile version 0.8.9
main: seed  = 1721531043
llama_model_loader: loaded meta data with 37 key-value pairs and 75 tensors from TinyLLama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.name str              = Apache License Version 2.0, January 2004
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/datasets/choos...
llama_model_loader: - kv  11:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  13:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  14:                    general.source.repo_url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  15:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  18:                          llama.block_count u32              = 8
llama_model_loader: - kv  19:                       llama.context_length u32              = 2048
llama_model_loader: - kv  20:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  21:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  22:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                          general.file_type u32              = 1
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type  f16:   58 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_rot            = 4
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 4
llm_load_print_meta: n_embd_head_v    = 4
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 256
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 4.62 M
llm_load_print_meta: model size       = 8.82 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = TinyLLama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =     8.82 MiB
..............
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     1.00 MiB
llama_new_context_with_model: KV self size  =    1.00 MiB, K (f16):    0.50 MiB, V (f16):    0.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    62.75 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 hello world the gruff man said he had a dream. He thought about what he had to do. He took a deep breath and ran around the woods. The man was so excited! He couldn't wait to do the whole day. 
The man looked around and found a small box. He wanted to see what was inside. He picked up the box and started to climb. But the box was too hard. He looked around for the cake. 
The man was sad and he didn't know what to do. He asked his friend to help him. The man said no. He said he couldn't get the oven. He was sad and began to cry. 
The man felt bad because he knew it was okay. He wished he had been a good friend. He was very sad and he couldn't find the cake. [end of text]


llama_print_timings:        load time =       6.74 ms
llama_print_timings:      sample time =       6.76 ms /   169 runs   (    0.04 ms per token, 25011.10 tokens per second)
llama_print_timings: prompt eval time =       3.64 ms /     8 tokens (    0.46 ms per token,  2196.60 tokens per second)
llama_print_timings:        eval time =     340.85 ms /   168 runs   (    2.03 ms per token,   492.88 tokens per second)
llama_print_timings:       total time =     392.63 ms /   176 tokens
Log end
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
main: llamafile version 0.8.9
main: seed  = 1721531082
llama_model_loader: loaded meta data with 37 key-value pairs and 75 tensors from TinyLLama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.name str              = Apache License Version 2.0, January 2004
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/datasets/choos...
llama_model_loader: - kv  11:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  13:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  14:                    general.source.repo_url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  15:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  18:                          llama.block_count u32              = 8
llama_model_loader: - kv  19:                       llama.context_length u32              = 2048
llama_model_loader: - kv  20:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  21:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  22:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                          general.file_type u32              = 1
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type  f16:   58 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_rot            = 4
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 4
llm_load_print_meta: n_embd_head_v    = 4
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 256
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 4.62 M
llm_load_print_meta: model size       = 8.82 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = TinyLLama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =     8.82 MiB
..............
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     1.00 MiB
llama_new_context_with_model: KV self size  =    1.00 MiB, K (f16):    0.50 MiB, V (f16):    0.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    62.75 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 hello world the gruff man said yes and said hello to her.
The lady said to her, “I want to keep you a new friend! You are very nice and smart. I will be very proud."
The man smiled and said, “Yes, I will be more careful!”
The man looked up, but he knew she had a secret. He said, “I'm the right thing, but I have a special plan!”
The man was very surprised, but he smiled.
"That's so smart," he said. "I have to be careful to be kind to others. It's special to help me find things, but it's too late."
The man smiled and said, "No, you can't do it to be nice to you."
The man smiled and said, "You're welcome, I'm sorry for helping you. You are very nosy!"
The man was so happy. He knew that it had helped his friends, and he could be a friend. The people in the town loved to learn together. [end of text]


llama_print_timings:        load time =       8.66 ms
llama_print_timings:      sample time =       8.79 ms /   220 runs   (    0.04 ms per token, 25017.06 tokens per second)
llama_print_timings: prompt eval time =       1.59 ms /     8 tokens (    0.20 ms per token,  5015.67 tokens per second)
llama_print_timings:        eval time =     352.66 ms /   219 runs   (    1.61 ms per token,   620.99 tokens per second)
llama_print_timings:       total time =     415.12 ms /   227 tokens
Log end
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
main: llamafile version 0.8.9
main: seed  = 1721531150
llama_model_loader: loaded meta data with 37 key-value pairs and 75 tensors from TinyLLama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.name str              = Apache License Version 2.0, January 2004
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/datasets/choos...
llama_model_loader: - kv  11:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  13:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  14:                    general.source.repo_url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  15:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  18:                          llama.block_count u32              = 8
llama_model_loader: - kv  19:                       llama.context_length u32              = 2048
llama_model_loader: - kv  20:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  21:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  22:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                          general.file_type u32              = 1
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type  f16:   58 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_rot            = 4
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 4
llm_load_print_meta: n_embd_head_v    = 4
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 256
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 4.62 M
llm_load_print_meta: model size       = 8.82 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = TinyLLama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =     8.82 MiB
..............
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     1.00 MiB
llama_new_context_with_model: KV self size  =    1.00 MiB, K (f16):    0.50 MiB, V (f16):    0.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    62.75 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 hello world the gruff man said, “Hi, do you want to do something?". He looked up and said, “No!"
The man looked around and saw the big, tall hill. He wanted to see where he could see it. He jumped in and was so excited!
The lady said, “It's so messy! This is important to be careful if you can't make a big smile. Thank you for being very nice here."
The man thought for a moment and then said, “I'm sorry, I would want to eat it now!" 
The man nodded. He knew he could help the girl. He said, “No, we should not try it."
The boy was sad and said, “No, I don’t need some food. I'll be my friend." 
The man thought for a moment and then said, “You can't have to leave the garden when you're going to the park! We must always share it with your friends". 
The boy smiled and said, “Of course". 
The man smiled and said, “I'm glad you found this party! I'm very excited to play with you.” 
The boy smiled. He was very proud of himself. He said, “Do you want to have some fun? I like to watch this game. I don’t want to ask me if you can be good." 
The boy smiled and said, "Yes, I will, let's go, let's go!"
So, the boy and the boy hopped and started to walk and see the stars. They laughed and laughed, but he was very proud. [end of text]


llama_print_timings:        load time =      17.26 ms
llama_print_timings:      sample time =      13.20 ms /   347 runs   (    0.04 ms per token, 26287.88 tokens per second)
llama_print_timings: prompt eval time =       2.46 ms /     8 tokens (    0.31 ms per token,  3257.33 tokens per second)
llama_print_timings:        eval time =     550.67 ms /   346 runs   (    1.59 ms per token,   628.33 tokens per second)
llama_print_timings:       total time =     649.30 ms /   354 tokens
Log end
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
main: llamafile version 0.8.9
main: seed  = 1721531259
llama_model_loader: loaded meta data with 37 key-value pairs and 75 tensors from TinyLLama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.name str              = Apache License Version 2.0, January 2004
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/datasets/choos...
llama_model_loader: - kv  11:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  13:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  14:                    general.source.repo_url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  15:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  18:                          llama.block_count u32              = 8
llama_model_loader: - kv  19:                       llama.context_length u32              = 2048
llama_model_loader: - kv  20:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  21:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  22:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                          general.file_type u32              = 1
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type  f16:   58 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_rot            = 4
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 4
llm_load_print_meta: n_embd_head_v    = 4
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 256
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 4.62 M
llm_load_print_meta: model size       = 8.82 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = TinyLLama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =     8.82 MiB
..............
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     1.00 MiB
llama_new_context_with_model: KV self size  =    1.00 MiB, K (f16):    0.50 MiB, V (f16):    0.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    62.75 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 hello world the gruff man said he would help her. As he ran, he noticed something shiny in the sky. He looked around and saw a small, old lady, who was so excited! She said, “Can I try this?"
The old lady smiled and said, “Yes, but I have to keep the egg. It is so nice!”
The old man smiled. He said, “Yes, that is a good idea! I will stay in your house and give you a hug!"
The old man smiled, but then he said, “We can be very careful when you take it away". He said: “I want to be brave," The old man was so proud of his work. He said, “I need to be happy with this. Let's play together!"
The old man said, “No, I don’t want to go. We need to be careful."
The old man said, “Don't worry, I will be happy."
The old man smiled and said, "It's okay. We can try to take some more times. But be careful. Maybe you can't stop your friends."
The old man smiled and said, “Yes, you can. I'm here to help you. It's time for this problem." The old man nodded and said, “I will find it! We can are very careful with it".
The old man agreed. He gave the ugly man a big hug and said, “I know you would like it, but you don't need it."
The old man smiled and said, “You do so. I like the old man. I can be nice. He is my friend and I will help you get your way back to the party."
The old man smiled and said, “You don’s okay. You're so brave to find it, and I'm glad you have a new friend. That's a very nice idea and I'll take your things. [end of text]


llama_print_timings:        load time =       7.35 ms
llama_print_timings:      sample time =      13.73 ms /   409 runs   (    0.03 ms per token, 29780.11 tokens per second)
llama_print_timings: prompt eval time =       1.57 ms /     8 tokens (    0.20 ms per token,  5102.04 tokens per second)
llama_print_timings:        eval time =     972.37 ms /   408 runs   (    2.38 ms per token,   419.59 tokens per second)
llama_print_timings:       total time =    1089.29 ms /   416 tokens
Log end
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
main: llamafile version 0.8.9
main: seed  = 1721532670
llama_model_loader: loaded meta data with 37 key-value pairs and 75 tensors from TinyLLama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.name str              = Apache License Version 2.0, January 2004
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/datasets/choos...
llama_model_loader: - kv  11:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  12:                           general.repo_url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  13:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  14:                    general.source.repo_url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  15:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  18:                          llama.block_count u32              = 8
llama_model_loader: - kv  19:                       llama.context_length u32              = 2048
llama_model_loader: - kv  20:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  21:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  22:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                          general.file_type u32              = 1
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type  f16:   58 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 64
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_rot            = 4
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 4
llm_load_print_meta: n_embd_head_v    = 4
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 256
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 4.62 M
llm_load_print_meta: model size       = 8.82 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = TinyLLama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =     8.82 MiB
..............
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     1.00 MiB
llama_new_context_with_model: KV self size  =    1.00 MiB, K (f16):    0.50 MiB, V (f16):    0.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    62.75 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 hello world the gruff man said yes. The man said he could go. The man said the man was very brave and he wanted to go. The man was very curious and he decided to go on a trip.
After the story, the man was ready to go. He waved goodbye and said, "Don't worry, I will go on a walk." The man stepped down and said, "Don't be scared, I'xy. The man was a brave boy!"
The man smiled and said, "I'm sorry, I am so hungry." The man smiled and said, "That's nice. I'm proud of you."
The man said, "You can go with me. The man can fly fast for you, and I'll come with you." The man said, "Yes, let's do it!"
The man and the man ran around the village, and the man was happy. He was very kind. He was happy to have a friend. The man said, "You are very nice!"
The man hugged the man and said, "We are welcome, little boy. We were so proud of you. We are best friends."
The man smiled and said, "You are right, little boy, it's not a good thing to do. I have the way to the party to have fun." [end of text]


llama_print_timings:        load time =       6.42 ms
llama_print_timings:      sample time =       8.73 ms /   279 runs   (    0.03 ms per token, 31969.75 tokens per second)
llama_print_timings: prompt eval time =       1.53 ms /     8 tokens (    0.19 ms per token,  5239.03 tokens per second)
llama_print_timings:        eval time =     366.91 ms /   278 runs   (    1.32 ms per token,   757.69 tokens per second)
llama_print_timings:       total time =     440.41 ms /   286 tokens
Log end