smpanaro commited on
Commit
e0b038d
1 Parent(s): c2fa56c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - wikitext
4
+ metrics:
5
+ - perplexity
6
+ ---
7
+ **N**on-**u**niform **GPTQ** (NuGPTQ) combines [GPTQ](https://arxiv.org/abs/2210.17323), [SqueezeLLM](https://arxiv.org/abs/2306.07629) and [output scaling](https://stephenpanaro.com/blog/llm-quantization-for-iphone) for a competitive whole-tensor (no grouping) LLM compression method.
8
+
9
+ Results for Llama-2-7b-hf:
10
+ |Method |WikitextPPL (↓)|Delta |
11
+ |-- |-- |-- |
12
+ |float16 |8.7071 |0 |
13
+ |AWQ |8.9760 |0.2689|
14
+ |NuGPTQ (This)|9.2754 |0.5683|
15
+ |GPTQ† |9.4686 |0.7615|
16
+ <sub>† g128, desc_act=True</sub>
17
+
18
+ <details>
19
+ <summary>perplexity reproduction steps</summary>
20
+
21
+ ```shell
22
+ git clone https://github.com/EleutherAI/lm-evaluation-harness
23
+ cd lm-evaluation-harness
24
+ pip install -e .
25
+ pip install optimum
26
+
27
+ huggingface-cli login
28
+
29
+ # Set batch size based on your GPU.
30
+ lm_eval --model hf \
31
+ --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float16" \
32
+ --tasks wikitext \
33
+ --batch_size 1
34
+
35
+ # hf (pretrained=meta-llama/Llama-2-7b-hf,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
36
+ # | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
37
+ # |--------|------:|------|-----:|---------------|-----:|---|------|
38
+ # |wikitext| 2|none | 0|word_perplexity|8.7071|± |N/A |
39
+ # | | |none | 0|byte_perplexity|1.4989|± |N/A |
40
+ # | | |none | 0|bits_per_byte |0.5839|± |N/A |
41
+
42
+ lm_eval --model hf \
43
+ --model_args pretrained=smpanaro/Llama-2-7b-NuGPTQ,dtype="float16",use_safetensors=True,trust_remote_code=True \
44
+ --tasks wikitext \
45
+ --batch_size 1
46
+
47
+ # hf (pretrained=smpanaro/llama-2-7b-nugptq,dtype=float16,use_safetensors=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
48
+ # | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
49
+ # |--------|------:|------|-----:|---------------|-----:|---|------|
50
+ # |wikitext| 2|none | 0|word_perplexity|9.2754|± |N/A |
51
+ # | | |none | 0|byte_perplexity|1.5167|± |N/A |
52
+ # | | |none | 0|bits_per_byte |0.6009|± |N/A |
53
+
54
+ pip install auto-gptq
55
+ lm_eval --model hf \
56
+ --model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-128g-actorder_True \
57
+ --tasks wikitext \
58
+ --batch_size 1
59
+
60
+ # hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-128g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
61
+ # | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
62
+ # |--------|------:|------|-----:|---------------|-----:|---|------|
63
+ # |wikitext| 2|none | 0|word_perplexity|9.4686|± |N/A |
64
+ # | | |none | 0|byte_perplexity|1.5225|± |N/A |
65
+ # | | |none | 0|bits_per_byte |0.6065|± |N/A |
66
+
67
+ lm_eval --model hf \
68
+ --model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-32g-actorder_True \
69
+ --tasks wikitext \
70
+ --batch_size 1
71
+
72
+ # hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-32g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
73
+ # | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
74
+ # |--------|------:|------|-----:|---------------|-----:|---|------|
75
+ # |wikitext| 2|none | 0|word_perplexity|9.3801|± |N/A |
76
+ # | | |none | 0|byte_perplexity|1.5199|± |N/A |
77
+ # | | |none | 0|bits_per_byte |0.6040|± |N/A |
78
+
79
+ pip install autoawq
80
+ lm_eval --model hf \
81
+ --model_args pretrained=TheBloke/Llama-2-7B-AWQ,dtype="float16" \
82
+ --tasks wikitext \
83
+ --batch_size 1
84
+
85
+ # hf (pretrained=thebloke/llama-2-7b-awq,dtype=float16), gen_kwargs: (none), limit: none, num_fewshot: none, batch_size: 1
86
+ # | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
87
+ # |--------|------:|------|-----:|---------------|-----:|---|------|
88
+ # |wikitext| 2|none | 0|word_perplexity|8.9760|± |N/A |
89
+ # | | |none | 0|byte_perplexity|1.5074|± |N/A |
90
+ # | | |none | 0|bits_per_byte |0.5921|± |N/A |
91
+ ```
92
+
93
+ </details>
94
+
95
+
96
+ The model is fake quantized which means each weight has <= 16 (2<sup>4</sup>) unique values, but they are stored in float16. The uniqueness can be checked as follows:
97
+ ```python
98
+ from transformers import AutoModelForCausalLM
99
+
100
+ model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ")
101
+ linear_layers = ["k_proj", "q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
102
+ count = 0
103
+ for key, tensor in model.state_dict().items():
104
+ if "weight" not in key:
105
+ continue
106
+ if any([l in key for l in linear_layers]):
107
+ assert tensor.unique().shape[0] <= 16, f"{key} has more than 16 unique values"
108
+ print("✓", end="", flush=True)
109
+ count += 1
110
+
111
+ print()
112
+ # 32 model layers * 7 linear layers
113
+ print(f"{count} out of 224 linear layers have 16 unique values.")
114
+ ```