Update README.md
Browse files
README.md
CHANGED
@@ -20,11 +20,13 @@ This repo contains 4bit GPTQ models for GPU inference, quantised using [GPTQ-for
|
|
20 |
|
21 |
## PERFORMANCE ISSUES
|
22 |
|
23 |
-
I
|
24 |
|
25 |
-
|
26 |
|
27 |
-
|
|
|
|
|
28 |
|
29 |
## GIBBERISH OUTPUT IN `text-generation-webui`?
|
30 |
|
|
|
20 |
|
21 |
## PERFORMANCE ISSUES
|
22 |
|
23 |
+
For reasons I can't yet understand, there are performance problems with these 4bit GPTQs that I have not experienced with any other GPTQ 7B or 13B models.
|
24 |
|
25 |
+
I have re-made the GPTQs several times, trying various versions of GPTQ-for-LLaMa code. But I currently can't resolve it.
|
26 |
|
27 |
+
Using the act-order.safetensors file on Triton code performs acceptably for me, testing on a 4090 - eg 10-13 tokens/s. But the no-act-order.safetensor file, tested on the older CUDA oobabooga GPTQ-for-LLaMa code, returns only 4 tokens/s.
|
28 |
+
|
29 |
+
I will keep investigating and trying to work out what's happening here. But for the moment, if you're not able to use Triton GPTQ-for-LLaMa, you may want to try another 7B GPTQ model.
|
30 |
|
31 |
## GIBBERISH OUTPUT IN `text-generation-webui`?
|
32 |
|