Thireus
/

Vicuna13B-v1.1-8bit-128g

@@ -6,16 +6,22 @@ tags:
 ---
 ![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png)
 Q. Why quantized in 8bit instead of 4bit?
 A. For evaluation purpose. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out.
 Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) comes with noticeable quality (perplexity) degradation.
 Refs:
 - https://github.com/ggerganov/llama.cpp/pull/951
 - https://news.ycombinator.com/item?id=35148542
 - https://arxiv.org/abs/2105.03536
-- https://github.com/IST-DASLab/gptq
 - https://arxiv.org/abs/2212.09720
 <br>
@@ -25,7 +31,7 @@ Refs:
 - wbits: 8
 - true-sequential: yes
 - act-order: yes
-- 8-bit quantized
 - Conversion process: LLaMa 13B -> LLaMa 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g
 <br>
@@ -84,7 +90,9 @@ pip install -r requirements.txt
 - i9-7980XE OC @4.6Ghz
 - 11 tokens/s on average with Triton
-- Preliminary observations: better results than --load-in-8bits (To Be Confirmed)
 - Tested and working in both chat mode and text generation mode
 ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_01.png)

 ---
 ![demo](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_08.png)
+This is a 8bit GPTQ (not to be confused with 8bit RTN) version of Vicuna 13B v1.1 HF.
 Q. Why quantized in 8bit instead of 4bit?
 A. For evaluation purpose. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out.
 Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) comes with noticeable quality (perplexity) degradation.
+This model is also only useful until Vicuna30B or higher come to light, in which case a 8bit GPTQ version for these models would not fit consumer cards and might be less than a 4bit GPTQ (To Be Evaluated).
 Refs:
 - https://github.com/ggerganov/llama.cpp/pull/951
 - https://news.ycombinator.com/item?id=35148542
+- https://github.com/ggerganov/llama.cpp/issues/53
+- https://arxiv.org/abs/2210.17323
 - https://arxiv.org/abs/2105.03536
 - https://arxiv.org/abs/2212.09720
+- https://arxiv.org/abs/2301.00774
+- https://github.com/IST-DASLab/gptq
 <br>
 - wbits: 8
 - true-sequential: yes
 - act-order: yes
+- 8-bit GPTQ
 - Conversion process: LLaMa 13B -> LLaMa 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g
 <br>
 - i9-7980XE OC @4.6Ghz
 - 11 tokens/s on average with Triton
+- Equivalent tokens/s observed over the 4bit version
+- Pending preliminary observation: better quality results than 8bit RTN (--load-in-8bits) (To Be Confirmed)
+- Pending preliminary observation: slightly better quality results than 4bit GPTQ (To Be Confirmed)
 - Tested and working in both chat mode and text generation mode
 ![screenshot](https://thireus.com/AI/Thireus_Vicuna13B-v1.1-8bit-128g_01.png)