TheBloke commited on
Commit
a146a04
·
1 Parent(s): ae156bd

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -43
README.md CHANGED
@@ -43,21 +43,6 @@ This repo contains GPTQ model files for [Technology Innovation Institute's Falco
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
46
- ## Requirements
47
-
48
- Transformers version 4.33.0 is required.
49
-
50
- Due to the huge size of the model, the GPTQ has been sharded. This will break compatibility with AutoGPTQ, and therefore any clients/libraries that use AutoGPTQ directly.
51
-
52
- But they work great loaded directly through Transformers - and can be served using Text Generation Inference!
53
-
54
- ## Compatibility
55
-
56
- Currently these GPTQs are known to work with:
57
- - Transformers 4.33.0
58
- - [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) version 1.0.4
59
- - Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
60
-
61
  <!-- description end -->
62
  <!-- repositories-available start -->
63
  ## Repositories available
@@ -101,10 +86,10 @@ All recent GPTQ files are made with AutoGPTQ, and all files in non-main branches
101
 
102
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
103
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
104
- | main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 90.00 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
105
- | gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
106
- | gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
107
- | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | Processing, coming soon | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
108
 
109
  <!-- README_GPTQ.md-provided-files end -->
110
 
@@ -121,25 +106,22 @@ git clone --single-branch --branch main https://huggingface.co/TheBloke/Falcon-1
121
  <!-- README_GPTQ.md-text-generation-webui start -->
122
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
123
 
124
- **NOTE**: I have not tested this model with Text Generation Webui. It *should* work through the Transformers Loader. It will *not* work through the AutoGPTQ loader, due to the files being sharded.
125
-
126
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
127
 
128
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
129
 
130
  1. Click the **Model tab**.
131
  2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
132
- - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:gptq-3bit--1g-actorder_True`
133
  - see Provided Files above for the list of branches for each option.
134
  3. Click **Download**.
135
  4. The model will start downloading. Once it's finished it will say "Done".
136
- 5. Choose Loader: Transformers
137
- 6. In the top left, click the refresh icon next to **Model**.
138
- 7. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
139
- 8. The model will automatically load, and is now ready for use!
140
- 9. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
141
  * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
142
- 10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
143
  <!-- README_GPTQ.md-text-generation-webui end -->
144
 
145
  <!-- README_GPTQ.md-use-from-python start -->
@@ -147,36 +129,54 @@ It is strongly recommended to use the text-generation-webui one-click-installers
147
 
148
  ### Install the necessary packages
149
 
150
- Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ.
151
 
152
  ```shell
153
- pip3 install transformers>=4.33.0 optimum>=1.12.0
154
  pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
155
  ```
156
 
157
- ### Transformers sample code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
159
  ```python
160
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
161
 
162
  model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
163
-
164
  # To use a different branch, change revision
165
- # For example: revision="gptq-3bit--1g-actorder_True"
166
  model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
167
  device_map="auto",
 
168
  revision="main")
169
 
170
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
171
 
172
  prompt = "Tell me about AI"
173
- prompt_template=f'''User: {prompt}
174
- Assistant: '''
 
175
 
176
  print("\n\n*** Generate:")
177
 
178
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
179
- output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
180
  print(tokenizer.decode(output[0]))
181
 
182
  # Inference can also be done using transformers' pipeline
@@ -187,10 +187,11 @@ pipe = pipeline(
187
  model=model,
188
  tokenizer=tokenizer,
189
  max_new_tokens=512,
190
- temperature=0.7,
191
  do_sample=True,
 
192
  top_p=0.95,
193
- repetition_penalty=1.15
 
194
  )
195
 
196
  print(pipe(prompt_template)[0]['generated_text'])
@@ -200,13 +201,11 @@ print(pipe(prompt_template)[0]['generated_text'])
200
  <!-- README_GPTQ.md-compatibility start -->
201
  ## Compatibility
202
 
203
- The provided files have been tested with Transformers 4.33.0, and TGI 1.0.4.
204
-
205
- Because they are sharded, they will not yet via AutoGPTQ. It is hoped support will be added soon.
206
 
207
- Note: lack of support for AutoGPTQ doesn't affect your ability to load these models from Python code. It only affects third-party clients that might use AutoGPTQ.
208
 
209
- [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is confirmed working as of version 1.0.4.
210
  <!-- README_GPTQ.md-compatibility end -->
211
 
212
  <!-- footer start -->
 
43
 
44
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  <!-- description end -->
47
  <!-- repositories-available start -->
48
  ## Repositories available
 
86
 
87
  | Branch | Bits | GS | Act Order | Damp % | GPTQ Dataset | Seq Len | Size | ExLlama | Desc |
88
  | ------ | ---- | -- | --------- | ------ | ------------ | ------- | ---- | ------- | ---- |
89
+ | main | 4 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 10.00 GB | No | 4-bit, with Act Order and group size 128g. Uses even less VRAM than 64g, but with slightly lower accuracy. |
90
+ | gptq-3bit-128g-actorder_True | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 9.98 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False. |
91
+ | gptq-3bit--1g-actorder_True | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 9.93 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
92
+ | gptq-4bit--1g-actorder_True | 4 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 2048 | 92.74 GB | No | 4-bit, with Act Order. No group size, to lower VRAM requirements. |
93
 
94
  <!-- README_GPTQ.md-provided-files end -->
95
 
 
106
  <!-- README_GPTQ.md-text-generation-webui start -->
107
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
108
 
 
 
109
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
110
 
111
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
112
 
113
  1. Click the **Model tab**.
114
  2. Under **Download custom model or LoRA**, enter `TheBloke/Falcon-180B-GPTQ`.
115
+ - To download from a specific branch, enter for example `TheBloke/Falcon-180B-GPTQ:main`
116
  - see Provided Files above for the list of branches for each option.
117
  3. Click **Download**.
118
  4. The model will start downloading. Once it's finished it will say "Done".
119
+ 5. In the top left, click the refresh icon next to **Model**.
120
+ 6. In the **Model** dropdown, choose the model you just downloaded: `Falcon-180B-GPTQ`
121
+ 7. The model will automatically load, and is now ready for use!
122
+ 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
 
123
  * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
124
+ 9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
125
  <!-- README_GPTQ.md-text-generation-webui end -->
126
 
127
  <!-- README_GPTQ.md-use-from-python start -->
 
129
 
130
  ### Install the necessary packages
131
 
132
+ Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
133
 
134
  ```shell
135
+ pip3 install transformers>=4.32.0 optimum>=1.12.0
136
  pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
137
  ```
138
 
139
+ If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
140
+
141
+ ```shell
142
+ pip3 uninstall -y auto-gptq
143
+ git clone https://github.com/PanQiWei/AutoGPTQ
144
+ cd AutoGPTQ
145
+ pip3 install .
146
+ ```
147
+
148
+ ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
149
+
150
+ If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
151
+ ```shell
152
+ pip3 uninstall -y transformers
153
+ pip3 install git+https://github.com/huggingface/transformers.git
154
+ ```
155
+
156
+ ### You can then use the following code
157
 
158
  ```python
159
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
160
 
161
  model_name_or_path = "TheBloke/Falcon-180B-GPTQ"
 
162
  # To use a different branch, change revision
163
+ # For example: revision="main"
164
  model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
165
  device_map="auto",
166
+ trust_remote_code=False,
167
  revision="main")
168
 
169
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
170
 
171
  prompt = "Tell me about AI"
172
+ prompt_template=f'''{prompt}
173
+
174
+ '''
175
 
176
  print("\n\n*** Generate:")
177
 
178
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
179
+ output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
180
  print(tokenizer.decode(output[0]))
181
 
182
  # Inference can also be done using transformers' pipeline
 
187
  model=model,
188
  tokenizer=tokenizer,
189
  max_new_tokens=512,
 
190
  do_sample=True,
191
+ temperature=0.7,
192
  top_p=0.95,
193
+ top_k=40,
194
+ repetition_penalty=1.1
195
  )
196
 
197
  print(pipe(prompt_template)[0]['generated_text'])
 
201
  <!-- README_GPTQ.md-compatibility start -->
202
  ## Compatibility
203
 
204
+ The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
 
 
205
 
206
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
207
 
208
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
209
  <!-- README_GPTQ.md-compatibility end -->
210
 
211
  <!-- footer start -->