Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,18 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
inference: false
|
3 |
-
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
<!-- header start -->
|
@@ -21,7 +33,7 @@ license: other
|
|
21 |
|
22 |
These files are GPTQ 4bit model files for [H2O's GPT-GM-OASST1-Falcon 40B v2](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2).
|
23 |
|
24 |
-
It is the result of quantising to 4bit using [
|
25 |
|
26 |
## Repositories available
|
27 |
|
@@ -29,20 +41,33 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
|
|
29 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GGML)
|
30 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2)
|
31 |
|
32 |
-
##
|
33 |
|
34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
-
1.
|
37 |
-
2.
|
38 |
-
3.
|
39 |
-
4.
|
40 |
-
5.
|
41 |
-
6.
|
42 |
-
7.
|
43 |
-
8.
|
44 |
-
|
45 |
-
|
|
|
|
|
46 |
|
47 |
## How to use this GPTQ model from Python code
|
48 |
|
@@ -55,7 +80,6 @@ Then try the following example code:
|
|
55 |
```python
|
56 |
from transformers import AutoTokenizer, pipeline, logging
|
57 |
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
58 |
-
import argparse
|
59 |
|
60 |
model_name_or_path = "TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ"
|
61 |
model_basename = "gptq_model-4bit--1g"
|
@@ -67,15 +91,14 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
|
67 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
68 |
model_basename=model_basename,
|
69 |
use_safetensors=True,
|
70 |
-
trust_remote_code=
|
71 |
device="cuda:0",
|
72 |
use_triton=use_triton,
|
73 |
quantize_config=None)
|
74 |
|
75 |
# Note: check the prompt template is correct for this model.
|
76 |
prompt = "Tell me about AI"
|
77 |
-
prompt_template=f'''
|
78 |
-
ASSISTANT:'''
|
79 |
|
80 |
print("\n\n*** Generate:")
|
81 |
|
@@ -117,6 +140,16 @@ It was created without group_size to lower VRAM requirements, and with --act-ord
|
|
117 |
* Works with text-generation-webui, including one-click-installers.
|
118 |
* Parameters: Groupsize = -1. Act Order / desc_act = True.
|
119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
120 |
<!-- footer start -->
|
121 |
## Discord
|
122 |
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
library_name: transformers
|
5 |
+
tags:
|
6 |
+
- gpt
|
7 |
+
- llm
|
8 |
+
- large language model
|
9 |
+
- h2o-llmstudio
|
10 |
inference: false
|
11 |
+
thumbnail: >-
|
12 |
+
https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico
|
13 |
+
license: apache-2.0
|
14 |
+
datasets:
|
15 |
+
- OpenAssistant/oasst1
|
16 |
---
|
17 |
|
18 |
<!-- header start -->
|
|
|
33 |
|
34 |
These files are GPTQ 4bit model files for [H2O's GPT-GM-OASST1-Falcon 40B v2](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2).
|
35 |
|
36 |
+
It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
|
37 |
|
38 |
## Repositories available
|
39 |
|
|
|
41 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GGML)
|
42 |
* [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2)
|
43 |
|
44 |
+
## Prompt template
|
45 |
|
46 |
+
```
|
47 |
+
<|prompt|>prompt<|endoftext|>
|
48 |
+
<|answer|>
|
49 |
+
```
|
50 |
+
|
51 |
+
## EXPERIMENTAL
|
52 |
+
|
53 |
+
Please note this is an experimental GPTQ model. Support for it is currently quite limited.
|
54 |
+
|
55 |
+
It is also expected to be **VERY SLOW**. This is unavoidable at the moment, but is being looked at.
|
56 |
+
|
57 |
+
## How to download and use this model in text-generation-webui
|
58 |
|
59 |
+
1. Launch text-generation-webui
|
60 |
+
2. Click the **Model tab**.
|
61 |
+
3. Untick **Autoload model**
|
62 |
+
4. Under **Download custom model or LoRA**, enter `TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ`.
|
63 |
+
5. Click **Download**.
|
64 |
+
6. Wait until it says it's finished downloading.
|
65 |
+
7. Click the **Refresh** icon next to **Model** in the top left.
|
66 |
+
8. In the **Model drop-down**: choose the model you just downloaded, `TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ`.
|
67 |
+
9. Make sure **Loader** is set to **AutoGPTQ**. This model will not work with ExLlama or GPTQ-for-LLaMa.
|
68 |
+
10. Tick **Trust Remote Code**, followed by **Save Settings**
|
69 |
+
11. Click **Reload**.
|
70 |
+
12. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
|
71 |
|
72 |
## How to use this GPTQ model from Python code
|
73 |
|
|
|
80 |
```python
|
81 |
from transformers import AutoTokenizer, pipeline, logging
|
82 |
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
|
|
83 |
|
84 |
model_name_or_path = "TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ"
|
85 |
model_basename = "gptq_model-4bit--1g"
|
|
|
91 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
92 |
model_basename=model_basename,
|
93 |
use_safetensors=True,
|
94 |
+
trust_remote_code=True,
|
95 |
device="cuda:0",
|
96 |
use_triton=use_triton,
|
97 |
quantize_config=None)
|
98 |
|
99 |
# Note: check the prompt template is correct for this model.
|
100 |
prompt = "Tell me about AI"
|
101 |
+
prompt_template=f'''<|prompt|>{prompt}<|endoftext|><|answer|>'''
|
|
|
102 |
|
103 |
print("\n\n*** Generate:")
|
104 |
|
|
|
140 |
* Works with text-generation-webui, including one-click-installers.
|
141 |
* Parameters: Groupsize = -1. Act Order / desc_act = True.
|
142 |
|
143 |
+
## FAQ
|
144 |
+
|
145 |
+
### About `trust-remote-code`
|
146 |
+
|
147 |
+
Please be aware that this command line argument causes Python code provided by Falcon to be executed on your machine.
|
148 |
+
|
149 |
+
This code is required at the moment because Falcon is too new to be supported by Hugging Face transformers. At some point in the future transformers will support the model natively, and then `trust_remote_code` will no longer be needed.
|
150 |
+
|
151 |
+
In this repo you can see two `.py` files - these are the files that get executed. They are copied from the base repo at [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
|
152 |
+
|
153 |
<!-- footer start -->
|
154 |
## Discord
|
155 |
|