Update README.md
Browse files
README.md
CHANGED
@@ -43,6 +43,21 @@ This repo contains GPTQ model files for [Technology Innovation Institute's Falco
|
|
43 |
|
44 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
<!-- description end -->
|
47 |
<!-- repositories-available start -->
|
48 |
## Repositories available
|
@@ -86,7 +101,7 @@ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches
|
|
86 |
|
87 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
88 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
89 |
-
| main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 |
|
90 |
| gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
|
91 |
| gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
92 |
| gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
|
@@ -106,22 +121,25 @@ git clone --single-branch --branch main https://huggingface.co/TheBloke/Falcon-1
|
|
106 |
<!-- README_GPTQ.md-text-generation-webui start -->
|
107 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
108 |
|
|
|
|
|
109 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
110 |
|
111 |
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
|
112 |
|
113 |
1. Click the **Model tab**.
|
114 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
|
115 |
-
- To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:
|
116 |
- see Provided Files above for the list of branches for each option.
|
117 |
3. Click **Download**.
|
118 |
4. The model will start downloading. Once it's finished it will say "Done".
|
119 |
-
5.
|
120 |
-
6. In the
|
121 |
-
7.
|
122 |
-
8.
|
|
|
123 |
* Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
124 |
-
|
125 |
<!-- README_GPTQ.md-text-generation-webui end -->
|
126 |
|
127 |
<!-- README_GPTQ.md-use-from-python start -->
|
@@ -129,54 +147,36 @@ It is strongly recommended to use the text-generation-webui one-click-installers
|
|
129 |
|
130 |
### Install the necessary packages
|
131 |
|
132 |
-
Requires: Transformers 4.
|
133 |
|
134 |
```shell
|
135 |
-
pip3 install transformers>=4.
|
136 |
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
137 |
```
|
138 |
|
139 |
-
|
140 |
-
|
141 |
-
```shell
|
142 |
-
pip3 uninstall -y auto-gptq
|
143 |
-
git clone https://github.com/PanQiWei/AutoGPTQ
|
144 |
-
cd AutoGPTQ
|
145 |
-
pip3 install .
|
146 |
-
```
|
147 |
-
|
148 |
-
### For CodeLlama models only: you must use Transformers 4.33.0 or later.
|
149 |
-
|
150 |
-
If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
|
151 |
-
```shell
|
152 |
-
pip3 uninstall -y transformers
|
153 |
-
pip3 install git+https://github.com/huggingface/transformers.git
|
154 |
-
```
|
155 |
-
|
156 |
-
### You can then use the following code
|
157 |
|
158 |
```python
|
159 |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
160 |
|
161 |
model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
|
|
|
162 |
# To use a different branch, change revision
|
163 |
-
# For example: revision="
|
164 |
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
165 |
device_map="auto",
|
166 |
-
trust_remote_code=False,
|
167 |
revision="main")
|
168 |
|
169 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
170 |
|
171 |
prompt = "Tell me about AI"
|
172 |
-
prompt_template=f'''{prompt}
|
173 |
-
|
174 |
-
'''
|
175 |
|
176 |
print("\n\n*** Generate:")
|
177 |
|
178 |
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
|
179 |
-
output = model.generate(inputs=input_ids,
|
180 |
print(tokenizer.decode(output[0]))
|
181 |
|
182 |
# Inference can also be done using transformers' pipeline
|
@@ -187,11 +187,10 @@ pipe = pipeline(
|
|
187 |
model=model,
|
188 |
tokenizer=tokenizer,
|
189 |
max_new_tokens=512,
|
190 |
-
do_sample=True,
|
191 |
temperature=0.7,
|
|
|
192 |
top_p=0.95,
|
193 |
-
|
194 |
-
repetition_penalty=1.1
|
195 |
)
|
196 |
|
197 |
print(pipe(prompt_template)[0]['generated_text'])
|
@@ -201,11 +200,13 @@ print(pipe(prompt_template)[0]['generated_text'])
|
|
201 |
<!-- README_GPTQ.md-compatibility start -->
|
202 |
## Compatibility
|
203 |
|
204 |
-
The files
|
|
|
|
|
205 |
|
206 |
-
|
207 |
|
208 |
-
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is
|
209 |
<!-- README_GPTQ.md-compatibility end -->
|
210 |
|
211 |
<!-- footer start -->
|
|
|
43 |
|
44 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
45 |
|
46 |
+
## Requirements
|
47 |
+
|
48 |
+
Transformers version 4.33.0 is required.
|
49 |
+
|
50 |
+
Due to the huge size of the model, the GPTQ has been sharded. This will break compatibility with AutoGPTQ, and therefore any clients/libraries that use AutoGPTQ directly.
|
51 |
+
|
52 |
+
But they work great loaded directly through Transformers - and can be served using Text Generation Inference!
|
53 |
+
|
54 |
+
## Compatibility
|
55 |
+
|
56 |
+
Currently these GPTQs are known to work with:
|
57 |
+
- Transformers 4.33.0
|
58 |
+
- [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.0.4
|
59 |
+
- Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
|
60 |
+
|
61 |
<!-- description end -->
|
62 |
<!-- repositories-available start -->
|
63 |
## Repositories available
|
|
|
101 |
|
102 |
| Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
|
103 |
| ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
|
104 |
+
| main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 90.00 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
|
105 |
| gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
|
106 |
| gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
107 |
| gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
|
|
|
121 |
<!-- README_GPTQ.md-text-generation-webui start -->
|
122 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
123 |
|
124 |
+
**NOTE**: I have not tested this model with Text Generation Webui. It *should* work through the Transformers Loader. It will *not* work through the AutoGPTQ loader, due to the files being sharded.
|
125 |
+
|
126 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
127 |
|
128 |
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
|
129 |
|
130 |
1. Click the **Model tab**.
|
131 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
|
132 |
+
- To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:gptq-3bit--1g-actorder_True`
|
133 |
- see Provided Files above for the list of branches for each option.
|
134 |
3. Click **Download**.
|
135 |
4. The model will start downloading. Once it's finished it will say "Done".
|
136 |
+
5. Choose Loader: Transformers
|
137 |
+
6. In the top left, click the refresh icon next to **Model**.
|
138 |
+
7. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
|
139 |
+
8. The model will automatically load, and is now ready for use!
|
140 |
+
9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
141 |
* Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
142 |
+
10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
143 |
<!-- README_GPTQ.md-text-generation-webui end -->
|
144 |
|
145 |
<!-- README_GPTQ.md-use-from-python start -->
|
|
|
147 |
|
148 |
### Install the necessary packages
|
149 |
|
150 |
+
Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ.
|
151 |
|
152 |
```shell
|
153 |
+
pip3 install transformers>=4.33.0 optimum>=1.12.0
|
154 |
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
155 |
```
|
156 |
|
157 |
+
### Transformers sample code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
158 |
|
159 |
```python
|
160 |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
161 |
|
162 |
model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
|
163 |
+
|
164 |
# To use a different branch, change revision
|
165 |
+
# For example: revision="gptq-3bit--1g-actorder_True"
|
166 |
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
167 |
device_map="auto",
|
|
|
168 |
revision="main")
|
169 |
|
170 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
171 |
|
172 |
prompt = "Tell me about AI"
|
173 |
+
prompt_template=f'''User: {prompt}
|
174 |
+
Assistant: '''
|
|
|
175 |
|
176 |
print("\n\n*** Generate:")
|
177 |
|
178 |
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
|
179 |
+
output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
|
180 |
print(tokenizer.decode(output[0]))
|
181 |
|
182 |
# Inference can also be done using transformers' pipeline
|
|
|
187 |
model=model,
|
188 |
tokenizer=tokenizer,
|
189 |
max_new_tokens=512,
|
|
|
190 |
temperature=0.7,
|
191 |
+
do_sample=True,
|
192 |
top_p=0.95,
|
193 |
+
repetition_penalty=1.15
|
|
|
194 |
)
|
195 |
|
196 |
print(pipe(prompt_template)[0]['generated_text'])
|
|
|
200 |
<!-- README_GPTQ.md-compatibility start -->
|
201 |
## Compatibility
|
202 |
|
203 |
+
The provided files have been tested with Transformers 4.33.0, and TGI 1.0.4.
|
204 |
+
|
205 |
+
Because they are sharded, they will not yet via AutoGPTQ. It is hoped support will be added soon.
|
206 |
|
207 |
+
Note: lack of support for AutoGPTQ doesn't affect your ability to load these models from Python code. It only affects third-party clients that might use AutoGPTQ.
|
208 |
|
209 |
+
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is confirmed working as of version 1.0.4.
|
210 |
<!-- README_GPTQ.md-compatibility end -->
|
211 |
|
212 |
<!-- footer start -->
|