Initial GPTQ model upload

Browse files

Files changed (9) hide show

README.md +12 -24
config.json +23 -0
generation_config.json +7 -0
gptq_model-4bit-128g.safetensors +3 -0
quantize_config.json +10 -0
special_tokens_map.json +24 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +34 -0

README.md CHANGED Viewed

@@ -1,12 +1,6 @@
 ---
 inference: false
-license: cc
-datasets:
-- VMware/open-instruct-v1-oasst-dolly-hhrlhf
-language:
-- en
-library_name: transformers
-pipeline_tag: text-generation
 ---
 <!-- header start -->
@@ -27,7 +21,7 @@ pipeline_tag: text-generation
 These files are GPTQ 4bit model files for [VMWare's open-llama-7B-open-instruct](https://huggingface.co/VMware/open-llama-7b-open-instruct).
-It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
 ## Repositories available
@@ -35,15 +29,6 @@ It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQi
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/open-llama-7b-open-instruct-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/VMware/open-llama-7b-open-instruct)
-## Prompt template
-```
-Below is an instruction that describes a task. Write a response that appropriately completes the request
-### Instruction: prompt
-### Response:
-```
 ## How to easily download and use this model in text-generation-webui
 Please make sure you're using the latest version of text-generation-webui
@@ -73,22 +58,20 @@ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
 import argparse
 model_name_or_path = "TheBloke/open-llama-7b-open-instruct-GPTQ"
 use_triton = False
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         use_safetensors=True,
         trust_remote_code=True,
         device="cuda:0",
         use_triton=use_triton,
         quantize_config=None)
-prompt = "Tell me about AI"
-prompt_template=f'''### Instruction: {prompt}
-### Response:'''
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
@@ -100,6 +83,10 @@ print(tokenizer.decode(output[0]))
 # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
 logging.set_verbosity(logging.CRITICAL)
 print("*** Pipeline:")
 pipe = pipeline(
     "text-generation",
@@ -116,14 +103,15 @@ print(pipe(prompt_template)[0]['generated_text'])
 ## Provided files
-**gptq_model-4bit-128g.safetensors**
-This is tested to work with AutoGPTQ. It may also work with GPTQ-for-LLaMa but this is untested.
 It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed.
-* `gptq_model-4bit-128g.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.
   * Works with text-generation-webui, including one-click-installers.
   * Parameters: Groupsize = 128. Act Order / desc_act = False.

 ---
 inference: false
+license: other
 ---
 <!-- header start -->
 These files are GPTQ 4bit model files for [VMWare's open-llama-7B-open-instruct](https://huggingface.co/VMware/open-llama-7b-open-instruct).
+It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
 ## Repositories available
 * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/open-llama-7b-open-instruct-GGML)
 * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/VMware/open-llama-7b-open-instruct)
 ## How to easily download and use this model in text-generation-webui
 Please make sure you're using the latest version of text-generation-webui
 import argparse
 model_name_or_path = "TheBloke/open-llama-7b-open-instruct-GPTQ"
+model_basename = "open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order"
 use_triton = False
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
+        model_basename=model_basename,
         use_safetensors=True,
         trust_remote_code=True,
         device="cuda:0",
         use_triton=use_triton,
         quantize_config=None)
 print("\n\n*** Generate:")
 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
 # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
 logging.set_verbosity(logging.CRITICAL)
+prompt = "Tell me about AI"
+prompt_template=f'''### Human: {prompt}
+### Assistant:'''
 print("*** Pipeline:")
 pipe = pipeline(
     "text-generation",
 ## Provided files
+**open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order.safetensors**
+This will work with AutoGPTQ and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
 It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed.
+* `open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.
+  * Works with GPTQ-for-LLaMa in CUDA mode.  May have issues with GPTQ-for-LLaMa Triton mode.
   * Works with text-generation-webui, including one-click-installers.
   * Parameters: Groupsize = 128. Act Order / desc_act = False.

config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "_name_or_path": "/home/gollapudit/peft/open_llama_open_instruct_v1.1",
+  "architectures": [
+    "LlamaForCausalLM"
+  ],
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 11008,
+  "max_position_embeddings": 2048,
+  "model_type": "llama",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "pad_token_id": 0,
+  "rms_norm_eps": 1e-06,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float16",
+  "transformers_version": "4.28.1",
+  "use_cache": true,
+  "vocab_size": 32000
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "pad_token_id": 0,
+  "transformers_version": "4.28.1"
+}

gptq_model-4bit-128g.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3376ab3dc59deb38d15e566f93093a0a8e46e3362ff937a394176ab6f2e7dd3
+size 3896726080

quantize_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "bits": 4,
+  "group_size": 128,
+  "damp_percent": 0.01,
+  "desc_act": false,
+  "sym": true,
+  "true_sequential": true,
+  "model_name_or_path": null,
+  "model_file_base_name": null
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<unk>",
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab1b681ec7fc02fed5edd3026687d7a692a918c4dd8e150ca2e3994a6229843b
+size 534194

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "clean_up_tokenization_spaces": false,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "model_max_length": 2048,
+  "pad_token": null,
+  "padding_side": "right",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}