TheBloke commited on
Commit
5109fbb
1 Parent(s): bce43c2

Initial GPTQ model upload

Browse files
README.md CHANGED
@@ -1,12 +1,6 @@
1
  ---
2
  inference: false
3
- license: cc
4
- datasets:
5
- - VMware/open-instruct-v1-oasst-dolly-hhrlhf
6
- language:
7
- - en
8
- library_name: transformers
9
- pipeline_tag: text-generation
10
  ---
11
 
12
  <!-- header start -->
@@ -27,7 +21,7 @@ pipeline_tag: text-generation
27
 
28
  These files are GPTQ 4bit model files for [VMWare's open-llama-7B-open-instruct](https://huggingface.co/VMware/open-llama-7b-open-instruct).
29
 
30
- It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
31
 
32
  ## Repositories available
33
 
@@ -35,15 +29,6 @@ It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQi
35
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/open-llama-7b-open-instruct-GGML)
36
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/VMware/open-llama-7b-open-instruct)
37
 
38
- ## Prompt template
39
-
40
- ```
41
- Below is an instruction that describes a task. Write a response that appropriately completes the request
42
-
43
- ### Instruction: prompt
44
- ### Response:
45
- ```
46
-
47
  ## How to easily download and use this model in text-generation-webui
48
 
49
  Please make sure you're using the latest version of text-generation-webui
@@ -73,22 +58,20 @@ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
73
  import argparse
74
 
75
  model_name_or_path = "TheBloke/open-llama-7b-open-instruct-GPTQ"
 
76
 
77
  use_triton = False
78
 
79
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
80
 
81
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
 
82
  use_safetensors=True,
83
  trust_remote_code=True,
84
  device="cuda:0",
85
  use_triton=use_triton,
86
  quantize_config=None)
87
 
88
- prompt = "Tell me about AI"
89
- prompt_template=f'''### Instruction: {prompt}
90
- ### Response:'''
91
-
92
  print("\n\n*** Generate:")
93
 
94
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
@@ -100,6 +83,10 @@ print(tokenizer.decode(output[0]))
100
  # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
101
  logging.set_verbosity(logging.CRITICAL)
102
 
 
 
 
 
103
  print("*** Pipeline:")
104
  pipe = pipeline(
105
  "text-generation",
@@ -116,14 +103,15 @@ print(pipe(prompt_template)[0]['generated_text'])
116
 
117
  ## Provided files
118
 
119
- **gptq_model-4bit-128g.safetensors**
120
 
121
- This is tested to work with AutoGPTQ. It may also work with GPTQ-for-LLaMa but this is untested.
122
 
123
  It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed.
124
 
125
- * `gptq_model-4bit-128g.safetensors`
126
  * Works with AutoGPTQ in CUDA or Triton modes.
 
127
  * Works with text-generation-webui, including one-click-installers.
128
  * Parameters: Groupsize = 128. Act Order / desc_act = False.
129
 
 
1
  ---
2
  inference: false
3
+ license: other
 
 
 
 
 
 
4
  ---
5
 
6
  <!-- header start -->
 
21
 
22
  These files are GPTQ 4bit model files for [VMWare's open-llama-7B-open-instruct](https://huggingface.co/VMware/open-llama-7b-open-instruct).
23
 
24
+ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
25
 
26
  ## Repositories available
27
 
 
29
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/open-llama-7b-open-instruct-GGML)
30
  * [Unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/VMware/open-llama-7b-open-instruct)
31
 
 
 
 
 
 
 
 
 
 
32
  ## How to easily download and use this model in text-generation-webui
33
 
34
  Please make sure you're using the latest version of text-generation-webui
 
58
  import argparse
59
 
60
  model_name_or_path = "TheBloke/open-llama-7b-open-instruct-GPTQ"
61
+ model_basename = "open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order"
62
 
63
  use_triton = False
64
 
65
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
66
 
67
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
68
+ model_basename=model_basename,
69
  use_safetensors=True,
70
  trust_remote_code=True,
71
  device="cuda:0",
72
  use_triton=use_triton,
73
  quantize_config=None)
74
 
 
 
 
 
75
  print("\n\n*** Generate:")
76
 
77
  input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
 
83
  # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
84
  logging.set_verbosity(logging.CRITICAL)
85
 
86
+ prompt = "Tell me about AI"
87
+ prompt_template=f'''### Human: {prompt}
88
+ ### Assistant:'''
89
+
90
  print("*** Pipeline:")
91
  pipe = pipeline(
92
  "text-generation",
 
103
 
104
  ## Provided files
105
 
106
+ **open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order.safetensors**
107
 
108
+ This will work with AutoGPTQ and CUDA versions of GPTQ-for-LLaMa. There are reports of issues with Triton mode of recent GPTQ-for-LLaMa. If you have issues, please use AutoGPTQ instead.
109
 
110
  It was created with group_size 128 to increase inference accuracy, but without --act-order (desc_act) to increase compatibility and improve inference speed.
111
 
112
+ * `open-llama-7B-open-instruct-GPTQ-4bit-128g.no-act.order.safetensors`
113
  * Works with AutoGPTQ in CUDA or Triton modes.
114
+ * Works with GPTQ-for-LLaMa in CUDA mode. May have issues with GPTQ-for-LLaMa Triton mode.
115
  * Works with text-generation-webui, including one-click-installers.
116
  * Parameters: Groupsize = 128. Act Order / desc_act = False.
117
 
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/home/gollapudit/peft/open_llama_open_instruct_v1.1",
3
+ "architectures": [
4
+ "LlamaForCausalLM"
5
+ ],
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 2,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 4096,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 11008,
12
+ "max_position_embeddings": 2048,
13
+ "model_type": "llama",
14
+ "num_attention_heads": 32,
15
+ "num_hidden_layers": 32,
16
+ "pad_token_id": 0,
17
+ "rms_norm_eps": 1e-06,
18
+ "tie_word_embeddings": false,
19
+ "torch_dtype": "float16",
20
+ "transformers_version": "4.28.1",
21
+ "use_cache": true,
22
+ "vocab_size": 32000
23
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.28.1"
7
+ }
gptq_model-4bit-128g.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3376ab3dc59deb38d15e566f93093a0a8e46e3362ff937a394176ab6f2e7dd3
3
+ size 3896726080
quantize_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bits": 4,
3
+ "group_size": 128,
4
+ "damp_percent": 0.01,
5
+ "desc_act": false,
6
+ "sym": true,
7
+ "true_sequential": true,
8
+ "model_name_or_path": null,
9
+ "model_file_base_name": null
10
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<unk>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab1b681ec7fc02fed5edd3026687d7a692a918c4dd8e150ca2e3994a6229843b
3
+ size 534194
tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "bos_token": {
5
+ "__type": "AddedToken",
6
+ "content": "<s>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "clean_up_tokenization_spaces": false,
13
+ "eos_token": {
14
+ "__type": "AddedToken",
15
+ "content": "</s>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "model_max_length": 2048,
22
+ "pad_token": null,
23
+ "padding_side": "right",
24
+ "sp_model_kwargs": {},
25
+ "tokenizer_class": "LlamaTokenizer",
26
+ "unk_token": {
27
+ "__type": "AddedToken",
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }