TheBloke commited on
Commit
45f33c6
1 Parent(s): 5b0da5b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -8
README.md CHANGED
@@ -21,19 +21,25 @@ inference: false
21
  </div>
22
  <!-- header end -->
23
 
24
- # Falcon-40B-Instruct 3bit GPTQ
25
 
26
- This repo contains an experimantal GPTQ 3bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
27
 
28
  It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
29
 
 
 
 
 
 
 
30
  ## EXPERIMENTAL
31
 
32
  Please note this is an experimental GPTQ model. Support for it is currently quite limited.
33
 
34
  It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
35
 
36
- This is a 3bit model with the aim of being loadable on a 24GB VRAM. In my testing so far it will not exceed 24GB VRAM at least up to 512 tokens returned. It may exceed 24GB beyond that.
37
 
38
  Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
39
 
@@ -65,11 +71,11 @@ So please first update text-genration-webui to the latest version.
65
 
66
  1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
67
  2. Click the **Model tab**.
68
- 3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-3bit-GPTQ`.
69
  4. Click **Download**.
70
  5. Wait until it says it's finished downloading.
71
  6. Click the **Refresh** icon next to **Model** in the top left.
72
- 7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-3bit-GPTQ`.
73
  8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
74
 
75
  ## About `trust_remote_code`
@@ -95,7 +101,7 @@ from transformers import AutoTokenizer
95
  from auto_gptq import AutoGPTQForCausalLM
96
 
97
  # Download the model from HF and store it locally, then reference its location here:
98
- quantized_model_dir = "/path/to/falcon40b-instruct-3bit-gptq"
99
 
100
  from transformers import AutoTokenizer
101
  tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
@@ -112,13 +118,13 @@ print(tokenizer.decode(output[0]))
112
 
113
  ## Provided files
114
 
115
- **gptq_model-3bit--1g.safetensors**
116
 
117
  This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
118
 
119
  It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
120
 
121
- * `gptq_model-3bit--1g.safetensors`
122
  * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
123
  * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
124
  * Works with text-generation-webui using `--autogptq --trust_remote_code`
 
21
  </div>
22
  <!-- header end -->
23
 
24
+ # Falcon-40B-Instruct 4bit GPTQ
25
 
26
+ This repo contains an experimantal GPTQ 4bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
27
 
28
  It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
29
 
30
+ ## Repositories available
31
+
32
+ * [4-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
33
+ * [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
34
+ * [Unquantised bf16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
35
+
36
  ## EXPERIMENTAL
37
 
38
  Please note this is an experimental GPTQ model. Support for it is currently quite limited.
39
 
40
  It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
41
 
42
+ This is 4bit model requires at least 35GB VRAM to load. It can be used on 40GB or 48GB cards, but not less.
43
 
44
  Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
45
 
 
71
 
72
  1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
73
  2. Click the **Model tab**.
74
+ 3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-GPTQ`.
75
  4. Click **Download**.
76
  5. Wait until it says it's finished downloading.
77
  6. Click the **Refresh** icon next to **Model** in the top left.
78
+ 7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-GPTQ`.
79
  8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
80
 
81
  ## About `trust_remote_code`
 
101
  from auto_gptq import AutoGPTQForCausalLM
102
 
103
  # Download the model from HF and store it locally, then reference its location here:
104
+ quantized_model_dir = "/path/to/falcon40b-instruct-GPTQ"
105
 
106
  from transformers import AutoTokenizer
107
  tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
 
118
 
119
  ## Provided files
120
 
121
+ **gptq_model-4bit--1g.safetensors**
122
 
123
  This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
124
 
125
  It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
126
 
127
+ * `gptq_model-4bit--1g.safetensors`
128
  * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
129
  * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
130
  * Works with text-generation-webui using `--autogptq --trust_remote_code`