TheBloke commited on
Commit
16922aa
1 Parent(s): 6392401

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -8
README.md CHANGED
@@ -40,7 +40,6 @@ To build cmp-nct's fork of llama.cpp with Falcon 40B support plus preliminary CU
40
  ```
41
  git clone https://github.com/cmp-nct/ggllm.cpp
42
  cd ggllm.cpp
43
- git checkout cuda-integration
44
  rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
45
  ```
46
 
@@ -48,25 +47,23 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
48
 
49
  Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
50
  ```
51
- bin/falcon_main -t 8 -ngl 100 -m /workspace/wizard-falcon40b.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
52
  ```
53
 
54
- Using `-ngl 100` will offload all layers to GPU. If you do not have enough VRAM for this, either lower the number or try a smaller quant size as otherwise performance will be severely affected.
55
 
56
  Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
57
 
 
 
58
  <!-- compatibility_ggml end -->
59
 
60
  ## Provided files
61
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
62
  | ---- | ---- | ---- | ---- | ---- | ----- |
63
- | wizard-falcon40b.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
64
- | wizard-falcon40b.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
65
- | wizard-falcon40b.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
66
  | wizard-falcon40b.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q3_K for all tensors |
67
- | wizard-falcon40b.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
68
  | wizard-falcon40b.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q4_K for all tensors |
69
- | wizard-falcon40b.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
70
  | wizard-falcon40b.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q5_K for all tensors |
71
  | wizard-falcon40b.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
72
  | wizard-falcon40b.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 
40
  ```
41
  git clone https://github.com/cmp-nct/ggllm.cpp
42
  cd ggllm.cpp
 
43
  rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
44
  ```
45
 
 
47
 
48
  Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
49
  ```
50
+ bin/falcon_main -t 8 -ngl 100 -b 1 -m /workspace/wizard-falcon40b.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
51
  ```
52
 
53
+ You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available can be used.
54
 
55
  Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
56
 
57
+ `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
58
+
59
  <!-- compatibility_ggml end -->
60
 
61
  ## Provided files
62
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
63
  | ---- | ---- | ---- | ---- | ---- | ----- |
64
+ | wizard-falcon40b.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | Uses GGML_TYPE_Q2_K for all tensors. |
 
 
65
  | wizard-falcon40b.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | Uses GGML_TYPE_Q3_K for all tensors |
 
66
  | wizard-falcon40b.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | Uses GGML_TYPE_Q4_K for all tensors |
 
67
  | wizard-falcon40b.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | Uses GGML_TYPE_Q5_K for all tensors |
68
  | wizard-falcon40b.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
69
  | wizard-falcon40b.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |