dranger003
/

c4ai-command-r7b-12-2024-GGUF

Text Generation

GGUF

Inference Endpoints

conversational

Model card Files Files and versions Community

dranger003 commited on 19 days ago

Commit

396f0bf

•

1 Parent(s): a6b8c22

Update README.md

Browse files

Files changed (1) hide show

README.md +44 -1

README.md CHANGED Viewed

@@ -11,4 +11,47 @@ GGUF version of [c4ai-command-r7b-12-2024](https://huggingface.co/CohereForAI/c4
 ./build/bin/llama-cli -fa --no-display-prompt -c 0 -m ggml-c4ai-command-r-7b-12-2024-q4_k.gguf -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>"
 ```
-https://github.com/ggerganov/llama.cpp/issues/10816#issuecomment-2548574766

 ./build/bin/llama-cli -fa --no-display-prompt -c 0 -m ggml-c4ai-command-r-7b-12-2024-q4_k.gguf -p "<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>You are a helpful assistant.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Tell me all about yourself.<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|><|START_RESPONSE|>"
 ```
+https://github.com/ggerganov/llama.cpp/issues/10816#issuecomment-2548574766
+```
+llama_new_context_with_model: n_seq_max     = 1
+llama_new_context_with_model: n_ctx         = 8192
+llama_new_context_with_model: n_ctx_per_seq = 8192
+llama_new_context_with_model: n_batch       = 2048
+llama_new_context_with_model: n_ubatch      = 512
+llama_new_context_with_model: flash_attn    = 1
+llama_new_context_with_model: freq_base     = 50000.0
+llama_new_context_with_model: freq_scale    = 1
+llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
+llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
+llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
+llama_new_context_with_model:      CUDA0 compute buffer size =  1328.31 MiB
+llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
+llama_new_context_with_model: graph nodes  = 841
+llama_new_context_with_model: graph splits = 324 (with bs=512), 1 (with bs=1)
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+main: llama threadpool init, n_threads = 16
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
+sampler seed: 2760461191
+sampler params:
+        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
+        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192
+        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
+        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
+sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
+generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1
+I am Command, a sophisticated large language model built by the company Cohere. I assist users by providing thorough responses to a wide range of queries, offering information, and performing various tasks. My capabilities include answering questions, generating text, summarizing content, extracting data, and performing various other tasks based on the user's requirements.
+I strive to provide accurate and helpful information while ensuring a positive and informative user experience. Feel free to ask me about any topic, and I'll do my best to assist you! [end of text]
+llama_perf_sampler_print:    sampling time =      15.07 ms /   128 runs   (    0.12 ms per token,  8491.44 tokens per second)
+llama_perf_context_print:        load time =    1076.84 ms
+llama_perf_context_print: prompt eval time =     181.62 ms /    22 tokens (    8.26 ms per token,   121.13 tokens per second)
+llama_perf_context_print:        eval time =    4938.01 ms /   105 runs   (   47.03 ms per token,    21.26 tokens per second)
+llama_perf_context_print:       total time =    5163.42 ms /   127 tokens
+```