Converting to gguf and running with llama.cpp?

#2
by Gord1i - opened

Firstly, thanks for the model - very cool initiative!

I'm trying to port the model into a gguf file for use by tools such as llama.cpp and ollama. llama.cpp provides a script for doing this, which introspects the hugging face repo and creates a gguf file.

If I run it as is, and then run the model via llama.cpp's CLI, I get the following error:

llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_k.weight' has wrong shape; expected  2048,   256, got  2048,  2048,     1,     1

which seems to suggest the attention parameters aren't internally consist, or the llama.cpp script is inferring the wrong values.

If I modify the parameter num_key_value_heads in this repo's config.json from 32 to 4, then llama.cpp will load up the model, and run it. However, It doesn't seem to return anything too sensible:

$ ./llama.cpp/llama-cli -m ./InkubaLM-0.4B/InkubaLM-0.4B-F32.gguf -p "Today i planned to " --temp 1 -n 10
Log start
main: build = 3568 (a21c6fd4)
main: built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
main: seed  = 1723400004
llama_model_loader: loaded meta data with 33 key-value pairs and 75 tensors from ./InkubaLM-0.4B/InkubaLM-0.4B-F32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Vulavula_Config
llama_model_loader: - kv   3:                           general.basename str              = InkubaLM
llama_model_loader: - kv   4:                         general.size_label str              = 0.4B
llama_model_loader: - kv   5:                            general.license str              = cc-by-nc-4.0
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["nlp", "InkubaLM", "africanLLM", "af...
llama_model_loader: - kv   7:                          general.languages arr[str,6]       = ["en", "sw", "zu", "xh", "ha", "yo"]
llama_model_loader: - kv   8:                           general.datasets arr[str,1]       = ["lelapa/Inkuba-Mono"]
llama_model_loader: - kv   9:                          llama.block_count u32              = 8
llama_model_loader: - kv  10:                       llama.context_length u32              = 2048
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 4
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 0
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 61788
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 512
llama_model_loader: - kv  20:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,61788]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,61788]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,61788]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   75 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.3459 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 61788
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 8
llm_load_print_meta: n_head           = 4
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 512
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 512
llm_load_print_meta: n_embd_head_v    = 512
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 664.16 M
llm_load_print_meta: model size       = 2.47 GiB (32.00 BPW) 
llm_load_print_meta: general.name     = Vulavula_Config
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.04 MiB
llm_load_tensors:        CPU buffer size =  2533.57 MiB
..............................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   128.00 MiB
llama_new_context_with_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.24 MiB
llama_new_context_with_model:        CPU compute buffer size =   124.68 MiB
llama_new_context_with_model: graph nodes  = 262
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 1.000
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 2048, n_batch = 2048, n_predict = 10, n_keep = 1


 Today i planned to GHmgro ge ge ge ge ge ge ge
llama_print_timings:        load time =     492.17 ms
llama_print_timings:      sample time =       1.40 ms /    10 runs   (    0.14 ms per token,  7168.46 tokens per second)
llama_print_timings: prompt eval time =     170.29 ms /     6 tokens (   28.38 ms per token,    35.23 tokens per second)
llama_print_timings:        eval time =    1282.36 ms /     9 runs   (  142.48 ms per token,     7.02 tokens per second)
llama_print_timings:       total time =    1456.98 ms /    15 tokens
Log end

The output is Today i planned to GHmgro ge ge ge ge ge ge ge (it's a little buried in the log).

Firstly, is this a fundamentally bad idea?

Secondly, assuming the previous answer is some version of No, anything that I'm obviously doing wrong?

Lelapa AI org

Hi @Gord1i ,
Thank you for your questions; we fixed some minor issues. Now you can convert it to gguf file. Let us know if you encounter any issue.

Atnafu changed discussion status to closed

Thanks, can confirm I'm now getting much more sensible answers out!

Sign up or log in to comment