TheBloke commited on
Commit
78dfd93
1 Parent(s): fa65c27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -39
README.md CHANGED
@@ -43,6 +43,21 @@ This repo contains GPTQ model files for [Technology Innovation Institute's Falco
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  <!-- description end -->
47
  <!-- repositories-available start -->
48
  ## Repositories available
@@ -86,7 +101,7 @@ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches
86
 
87
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
88
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
89
- | main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 10.00 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
90
  | gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
91
  | gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
92
  | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
@@ -106,22 +121,25 @@ git clone --single-branch --branch main https://huggingface.co/TheBloke/Falcon-1
106
  <!-- README_GPTQ.md-text-generation-webui start -->
107
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
108
 
 
 
109
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
110
 
111
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
112
 
113
  1. Click the **Model tab**.
114
  2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
115
- - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:main`
116
  - see Provided Files above for the list of branches for each option.
117
  3. Click **Download**.
118
  4. The model will start downloading. Once it's finished it will say "Done".
119
- 5. In the top left, click the refresh icon next to **Model**.
120
- 6. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
121
- 7. The model will automatically load, and is now ready for use!
122
- 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
 
123
  * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
124
- 9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
125
  <!-- README_GPTQ.md-text-generation-webui end -->
126
 
127
  <!-- README_GPTQ.md-use-from-python start -->
@@ -129,54 +147,36 @@ It is strongly recommended to use the text-generation-webui one-click-installers
129
 
130
  ### Install the necessary packages
131
 
132
- Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
133
 
134
  ```shell
135
- pip3 install transformers>=4.32.0 optimum>=1.12.0
136
  pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
137
  ```
138
 
139
- If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
140
-
141
- ```shell
142
- pip3 uninstall -y auto-gptq
143
- git clone https://github.com/PanQiWei/AutoGPTQ
144
- cd AutoGPTQ
145
- pip3 install .
146
- ```
147
-
148
- ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
149
-
150
- If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
151
- ```shell
152
- pip3 uninstall -y transformers
153
- pip3 install git+https://github.com/huggingface/transformers.git
154
- ```
155
-
156
- ### You can then use the following code
157
 
158
  ```python
159
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
160
 
161
  model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
 
162
  # To use a different branch, change revision
163
- # For example: revision="main"
164
  model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
165
  device_map="auto",
166
- trust_remote_code=False,
167
  revision="main")
168
 
169
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
170
 
171
  prompt = "Tell me about AI"
172
- prompt_template=f'''{prompt}
173
-
174
- '''
175
 
176
  print("\n\n*** Generate:")
177
 
178
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
179
- output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
180
  print(tokenizer.decode(output[0]))
181
 
182
  # Inference can also be done using transformers' pipeline
@@ -187,11 +187,10 @@ pipe = pipeline(
187
  model=model,
188
  tokenizer=tokenizer,
189
  max_new_tokens=512,
190
- do_sample=True,
191
  temperature=0.7,
 
192
  top_p=0.95,
193
- top_k=40,
194
- repetition_penalty=1.1
195
  )
196
 
197
  print(pipe(prompt_template)[0]['generated_text'])
@@ -201,11 +200,13 @@ print(pipe(prompt_template)[0]['generated_text'])
201
  <!-- README_GPTQ.md-compatibility start -->
202
  ## Compatibility
203
 
204
- The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
 
 
205
 
206
- [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
207
 
208
- [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
209
  <!-- README_GPTQ.md-compatibility end -->
210
 
211
  <!-- footer start -->
 
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
46
+ ## Requirements
47
+
48
+ Transformers version 4.33.0 is required.
49
+
50
+ Due to the huge size of the model, the GPTQ has been sharded. This will break compatibility with AutoGPTQ, and therefore any clients/libraries that use AutoGPTQ directly.
51
+
52
+ But they work great loaded directly through Transformers - and can be served using Text Generation Inference!
53
+
54
+ ## Compatibility
55
+
56
+ Currently these GPTQs are known to work with:
57
+ - Transformers 4.33.0
58
+ - [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.0.4
59
+ - Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
60
+
61
  <!-- description end -->
62
  <!-- repositories-available start -->
63
  ## Repositories available
 
101
 
102
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
103
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
104
+ | main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 90.00 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
105
  | gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
106
  | gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
107
  | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
 
121
  <!-- README_GPTQ.md-text-generation-webui start -->
122
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
123
 
124
+ **NOTE**: I have not tested this model with Text Generation Webui. It *should* work through the Transformers Loader. It will *not* work through the AutoGPTQ loader, due to the files being sharded.
125
+
126
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
127
 
128
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
129
 
130
  1. Click the **Model tab**.
131
  2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
132
+ - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:gptq-3bit--1g-actorder_True`
133
  - see Provided Files above for the list of branches for each option.
134
  3. Click **Download**.
135
  4. The model will start downloading. Once it's finished it will say "Done".
136
+ 5. Choose Loader: Transformers
137
+ 6. In the top left, click the refresh icon next to **Model**.
138
+ 7. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
139
+ 8. The model will automatically load, and is now ready for use!
140
+ 9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
141
  * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
142
+ 10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
143
  <!-- README_GPTQ.md-text-generation-webui end -->
144
 
145
  <!-- README_GPTQ.md-use-from-python start -->
 
147
 
148
  ### Install the necessary packages
149
 
150
+ Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ.
151
 
152
  ```shell
153
+ pip3 install transformers>=4.33.0 optimum>=1.12.0
154
  pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
155
  ```
156
 
157
+ ### Transformers sample code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
  ```python
160
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
161
 
162
  model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
163
+
164
  # To use a different branch, change revision
165
+ # For example: revision="gptq-3bit--1g-actorder_True"
166
  model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
167
  device_map="auto",
 
168
  revision="main")
169
 
170
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
171
 
172
  prompt = "Tell me about AI"
173
+ prompt_template=f'''User: {prompt}
174
+ Assistant: '''
 
175
 
176
  print("\n\n*** Generate:")
177
 
178
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
179
+ output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
180
  print(tokenizer.decode(output[0]))
181
 
182
  # Inference can also be done using transformers' pipeline
 
187
  model=model,
188
  tokenizer=tokenizer,
189
  max_new_tokens=512,
 
190
  temperature=0.7,
191
+ do_sample=True,
192
  top_p=0.95,
193
+ repetition_penalty=1.15
 
194
  )
195
 
196
  print(pipe(prompt_template)[0]['generated_text'])
 
200
  <!-- README_GPTQ.md-compatibility start -->
201
  ## Compatibility
202
 
203
+ The provided files have been tested with Transformers 4.33.0, and TGI 1.0.4.
204
+
205
+ Because they are sharded, they will not yet via AutoGPTQ. It is hoped support will be added soon.
206
 
207
+ Note: lack of support for AutoGPTQ doesn't affect your ability to load these models from Python code. It only affects third-party clients that might use AutoGPTQ.
208
 
209
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is confirmed working as of version 1.0.4.
210
  <!-- README_GPTQ.md-compatibility end -->
211
 
212
  <!-- footer start -->