TheBloke
/

h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ

@@ -1,6 +1,18 @@
 ---
 inference: false
-license: other
 ---
 <!-- header start -->
@@ -21,7 +33,7 @@ license: other
 These files are GPTQ 4bit model files for [H2O's GPT-GM-OASST1-Falcon 40B v2](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2).
-It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
 ## Repositories available
@@ -29,20 +41,33 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2)
-## How to easily download and use this model in text-generation-webui
-Please make sure you're using the latest version of text-generation-webui
-1. Click the **Model tab**.
-2. Under **Download custom model or LoRA**, enter `TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ`.
-3. Click **Download**.
-4. The model will start downloading. Once it's finished it will say "Done"
-5. In the top left, click the refresh icon next to **Model**.
-6. In the **Model** dropdown, choose the model you just downloaded: `h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ`
-7. The model will automatically load, and is now ready for use!
-8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
-  * Note that you do not need to and should not set manual GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
-9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 ## How to use this GPTQ model from Python code
@@ -55,7 +80,6 @@ Then try the following example code:
 ```python
 from transformers import AutoTokenizer, pipeline, logging
 from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
-import argparse
 model_name_or_path = "TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ"
 model_basename = "gptq_model-4bit--1g"
@@ -67,15 +91,14 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         model_basename=model_basename,
         use_safetensors=True,
-        trust_remote_code=False,
         device="cuda:0",
         use_triton=use_triton,
         quantize_config=None)
 # Note: check the prompt template is correct for this model.
 prompt = "Tell me about AI"
-prompt_template=f'''USER: {prompt}
-ASSISTANT:'''
 print("\n\n*** Generate:")
@@ -117,6 +140,16 @@ It was created without group_size to lower VRAM requirements, and with --act-ord
   * Works with text-generation-webui, including one-click-installers.
   * Parameters: Groupsize = -1. Act Order / desc_act = True.
 <!-- footer start -->
 ## Discord

 ---
+language:
+- en
+library_name: transformers
+tags:
+- gpt
+- llm
+- large language model
+- h2o-llmstudio
 inference: false
+thumbnail: >-
+  https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico
+license: apache-2.0
+datasets:
+- OpenAssistant/oasst1
 ---
 <!-- header start -->
 These files are GPTQ 4bit model files for [H2O's GPT-GM-OASST1-Falcon 40B v2](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2).
+It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
 ## Repositories available
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2)
+## Prompt template
+```
+<|prompt|>prompt<|endoftext|>
+<|answer|>
+```
+## EXPERIMENTAL
+Please note this is an experimental GPTQ model. Support for it is currently quite limited.
+It is also expected to be **VERY SLOW**. This is unavoidable at the moment, but is being looked at.
+## How to download and use this model in text-generation-webui
+1. Launch text-generation-webui
+2. Click the **Model tab**.
+3. Untick **Autoload model**
+4. Under **Download custom model or LoRA**, enter `TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ`.
+5. Click **Download**.
+6. Wait until it says it's finished downloading.
+7. Click the **Refresh** icon next to **Model** in the top left.
+8. In the **Model drop-down**: choose the model you just downloaded, `TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ`.
+9. Make sure **Loader** is set to **AutoGPTQ**. This model will not work with ExLlama or GPTQ-for-LLaMa.
+10. Tick **Trust Remote Code**, followed by **Save Settings**
+11. Click **Reload**.
+12. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
 ## How to use this GPTQ model from Python code
 ```python
 from transformers import AutoTokenizer, pipeline, logging
 from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 model_name_or_path = "TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2-GPTQ"
 model_basename = "gptq_model-4bit--1g"
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         model_basename=model_basename,
         use_safetensors=True,
+        trust_remote_code=True,
         device="cuda:0",
         use_triton=use_triton,
         quantize_config=None)
 # Note: check the prompt template is correct for this model.
 prompt = "Tell me about AI"
+prompt_template=f'''<|prompt|>{prompt}<|endoftext|><|answer|>'''
 print("\n\n*** Generate:")
   * Works with text-generation-webui, including one-click-installers.
   * Parameters: Groupsize = -1. Act Order / desc_act = True.
+## FAQ
+### About `trust-remote-code`
+Please be aware that this command line argument causes Python code provided by Falcon to be executed on your machine.
+This code is required at the moment because Falcon is too new to be supported by Hugging Face transformers. At some point in the future transformers will support the model natively, and then `trust_remote_code` will no longer be needed.
+In this repo you can see two `.py` files - these are the files that get executed. They are copied from the base repo at [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
 <!-- footer start -->
 ## Discord