TheBloke
/

falcon-40b-instruct-GPTQ

@@ -21,19 +21,25 @@ inference: false
 </div>
 <!-- header end -->
-# Falcon-40B-Instruct 3bit GPTQ
-This repo contains an experimantal GPTQ 3bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
 It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
 ## EXPERIMENTAL
 Please note this is an experimental GPTQ model. Support for it is currently quite limited.
 It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
-This is a 3bit model with the aim of being loadable on a 24GB VRAM.  In my testing so far it will not exceed 24GB VRAM at least up to 512 tokens returned. It may exceed 24GB beyond that.
 Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
@@ -65,11 +71,11 @@ So please first update text-genration-webui to the latest version.
 1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
 2. Click the **Model tab**.
-3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-3bit-GPTQ`.
 4. Click **Download**.
 5. Wait until it says it's finished downloading.
 6. Click the **Refresh** icon next to **Model** in the top left.
-7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-3bit-GPTQ`.
 8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
 ## About `trust_remote_code`
@@ -95,7 +101,7 @@ from transformers import AutoTokenizer
 from auto_gptq import AutoGPTQForCausalLM
 # Download the model from HF and store it locally, then reference its location here:
-quantized_model_dir = "/path/to/falcon40b-instruct-3bit-gptq"
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
@@ -112,13 +118,13 @@ print(tokenizer.decode(output[0]))
 ## Provided files
-**gptq_model-3bit--1g.safetensors**
 This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
 It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
-* `gptq_model-3bit--1g.safetensors`
   * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
     * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
   * Works with text-generation-webui using `--autogptq --trust_remote_code`

 </div>
 <!-- header end -->
+# Falcon-40B-Instruct 4bit GPTQ
+This repo contains an experimantal GPTQ 4bit model for [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
 It is the result of quantising to 3bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
+## Repositories available
+* [4-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ)
+* [3-bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/falcon-40b-instruct-3bit-GPTQ)
+* [Unquantised bf16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/tiiuae/falcon-40b-instruct)
 ## EXPERIMENTAL
 Please note this is an experimental GPTQ model. Support for it is currently quite limited.
 It is also expected to be **VERY SLOW**. This is currently unavoidable, but is being looked at.
+This is 4bit model requires at least 35GB VRAM to load. It can be used on 40GB or 48GB cards, but not less.
 Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
 1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
 2. Click the **Model tab**.
+3. Under **Download custom model or LoRA**, enter `TheBloke/falcon-40B-instruct-GPTQ`.
 4. Click **Download**.
 5. Wait until it says it's finished downloading.
 6. Click the **Refresh** icon next to **Model** in the top left.
+7. In the **Model drop-down**: choose the model you just downloaded, `falcon-40B-instruct-GPTQ`.
 8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
 ## About `trust_remote_code`
 from auto_gptq import AutoGPTQForCausalLM
 # Download the model from HF and store it locally, then reference its location here:
+quantized_model_dir = "/path/to/falcon40b-instruct-GPTQ"
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=False)
 ## Provided files
+**gptq_model-4bit--1g.safetensors**
 This will work with AutoGPTQ as of commit `3cb1bf5` (`3cb1bf5a6d43a06dc34c6442287965d1838303d3`)
 It was created without groupsize to reduce VRAM requirements, and with `desc_act` (act-order) to improve inference quality.
+* `gptq_model-4bit--1g.safetensors`
   * Works only with latest AutoGPTQ CUDA, compiled from source as of commit `3cb1bf5`
     * At this time it does not work with AutoGPTQ Triton, but support will hopefully be added in time.
   * Works with text-generation-webui using `--autogptq --trust_remote_code`