leafspark
/

DeepSeek-V2-Chat-GGUF

Text Generation

Inference Endpoints

Model card Files Files and versions Community

leafspark commited on May 21, 2024

Commit

0ceb327

·

verified ·

1 Parent(s): b3513b3

readme: add detailed instructions

Files changed (1) hide show

README.md +46 -0

README.md CHANGED Viewed

@@ -23,11 +23,57 @@ Using llama.cpp fork: [https://github.com/fairydreaming/llama.cpp/tree/deepseek-
 # How to use:
 - Find the relevant directory
 - Download all files
 - Run merge.py
 - Merged GGUF should appear
 # Quants:
 ```
 - bf16 [size: 439gb]

 # How to use:
+**Downloading the bf16:**
 - Find the relevant directory
 - Download all files
 - Run merge.py
 - Merged GGUF should appear
+**Downloading the quantizations:**
+- Find the relevant directory
+- Download all files
+- Point to the first split (most programs should load all the splits automatically now)
+**Running in llama.cpp:**
+To start in command line interactive mode (text completion):
+```
+main -m DeepSeek-V2-Chat.{quant}.gguf -c {context length} --color -i
+```
+To use llama.cpp OpenAI compatible server:
+```
+server \
+  -m DeepSeek-V2-Chat.{quant}.gguf \
+  -c {context_length} \
+  (--color [recommended: colored output in supported terminals]) \
+  (-i [note: interactive mode]) \
+  (--mlock [note: avoid using swap]) \
+  (--verbose) \
+  (--log-disable [note: disable logging to file, may be useful for prod]) \
+  (--metrics [note: prometheus compatible monitoring endpoint]) \
+  (--api-key [string]) \
+  (--port [int]) \
+  (--flash-attn [note: must be fully offloaded to supported GPU])
+```
+Making an importance matrix:
+```
+imatrix \
+  -m DeepSeek-V2-Chat.{quant}.gguf \
+  -f groups_merged.txt \
+  --verbosity [0, 1, 2] \
+  -ngl {GPU offloading; must build with CUDA} \
+ --ofreq {recommended: 1}
+```
+Making a quant:
+```
+quantize \
+  DeepSeek-V2-Chat.bf16.gguf \
+  DeepSeek-V2-Chat.{quant}.gguf \
+  {quant} \
+  (--imatrix [file])
+```
 # Quants:
 ```
 - bf16 [size: 439gb]