TheBloke
/

Falcon-180B-GPTQ

@@ -43,6 +43,21 @@ This repo contains GPTQ model files for [Technology Innovation Institute's Falco
 Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
 <!-- description end -->
 <!-- repositories-available start -->
 ## Repositories available
@@ -53,11 +68,10 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
 <!-- repositories-available end -->
 <!-- prompt-template start -->
-## Prompt template: None
 ```
 {prompt}
 ```
 <!-- prompt-template end -->
@@ -86,9 +100,9 @@ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches
 | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
 | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
-| main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 10.00 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
-| gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 9.98 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
-| gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 9.93 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
 | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 92.74 GB | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
 <!-- README_GPTQ.md-provided-files end -->
@@ -106,22 +120,25 @@ git clone --single-branch --branch main https://huggingface.co/TheBloke/Falcon-1
 <!-- README_GPTQ.md-text-generation-webui start -->
 ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
-  - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:main`
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done".
-5. In the top left, click the refresh icon next to **Model**.
-6. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
-7. The model will automatically load, and is now ready for use!
-8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
   * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
-9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 <!-- README_GPTQ.md-text-generation-webui end -->
 <!-- README_GPTQ.md-use-from-python start -->
@@ -129,54 +146,35 @@ It is strongly recommended to use the text-generation-webui one-click-installers
 ### Install the necessary packages
-Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
 ```shell
-pip3 install transformers>=4.32.0 optimum>=1.12.0
 pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # Use cu117 if on CUDA 11.7
 ```
-If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
-```shell
-pip3 uninstall -y auto-gptq
-git clone https://github.com/PanQiWei/AutoGPTQ
-cd AutoGPTQ
-pip3 install .
-```
-### For CodeLlama models only: you must use Transformers 4.33.0 or later.
-If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
-```shell
-pip3 uninstall -y transformers
-pip3 install git+https://github.com/huggingface/transformers.git
-```
-### You can then use the following code
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
 # To use a different branch, change revision
-# For example: revision="main"
 model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                              device_map="auto",
-                                             trust_remote_code=False,
                                              revision="main")
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 prompt = "Tell me about AI"
-prompt_template=f'''{prompt}
-'''
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
-output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
 print(tokenizer.decode(output[0]))
 # Inference can also be done using transformers' pipeline
@@ -187,11 +185,10 @@ pipe = pipeline(
     model=model,
     tokenizer=tokenizer,
     max_new_tokens=512,
-    do_sample=True,
     temperature=0.7,
     top_p=0.95,
-    top_k=40,
-    repetition_penalty=1.1
 )
 print(pipe(prompt_template)[0]['generated_text'])
@@ -201,11 +198,13 @@ print(pipe(prompt_template)[0]['generated_text'])
 <!-- README_GPTQ.md-compatibility start -->
 ## Compatibility
-The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
-[ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
-[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
 <!-- README_GPTQ.md-compatibility end -->
 <!-- footer start -->

 Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
+## Requirements
+Transformers version 4.33.0 is required.
+Due to the huge size of the model, the GPTQ has been sharded. This will break compatibility with AutoGPTQ, and therefore any clients/libraries that use AutoGPTQ directly.
+But they work great loaded directly through Transformers - and can be served using Text Generation Inference!
+## Compatibility
+Currently these GPTQs are known to work with:
+- Transformers 4.33.0
+- [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.0.4
+  - Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
 <!-- description end -->
 <!-- repositories-available start -->
 ## Repositories available
 <!-- repositories-available end -->
 <!-- prompt-template start -->
+## Prompt template: None (base model)
 ```
 {prompt}
 ```
 <!-- prompt-template end -->
 | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
 | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
+| main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 94.5 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
+| gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 73.81 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
+| gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 70.54 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
 | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 92.74 GB | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
 <!-- README_GPTQ.md-provided-files end -->
 <!-- README_GPTQ.md-text-generation-webui start -->
 ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
+**NOTE**: I have not tested this model with Text Generation Webui. It *should* work through the Transformers Loader.  It will *not* work through the AutoGPTQ loader, due to the files being sharded.
 Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
+  - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:gptq-3bit-128g-actorder_True`
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done".
+5. Choose Loader: Transformers
+6. In the top left, click the refresh icon next to **Model**.
+7. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
+8. The model will automatically load, and is now ready for use!
+9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
   * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
+10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 <!-- README_GPTQ.md-text-generation-webui end -->
 <!-- README_GPTQ.md-use-from-python start -->
 ### Install the necessary packages
+Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ.
 ```shell
+pip3 install transformers>=4.33.0 optimum>=1.12.0
 pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # Use cu117 if on CUDA 11.7
 ```
+### Transformers sample code
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
 # To use a different branch, change revision
+# For example: revision="gptq-3bit-128g-actorder_True"
 model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                              device_map="auto",
                                              revision="main")
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 prompt = "Tell me about AI"
+prompt_template=f'''{prompt}'''
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
+output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
 print(tokenizer.decode(output[0]))
 # Inference can also be done using transformers' pipeline
     model=model,
     tokenizer=tokenizer,
     max_new_tokens=512,
     temperature=0.7,
+    do_sample=True,
     top_p=0.95,
+    repetition_penalty=1.15
 )
 print(pipe(prompt_template)[0]['generated_text'])
 <!-- README_GPTQ.md-compatibility start -->
 ## Compatibility
+The provided files have been tested with Transformers 4.33.0, and TGI 1.0.4.
+Because they are sharded, they will not yet via AutoGPTQ. It is hoped support will be added soon.
+Note: lack of support for AutoGPTQ doesn't affect your ability to load these models from Python code. It only affects third-party clients that might use AutoGPTQ.
+[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is confirmed working as of version 1.0.4.
 <!-- README_GPTQ.md-compatibility end -->
 <!-- footer start -->