TheBloke
/

gpt4-alpaca-lora-30B-GPTQ

Text2Text Generation

text-generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

TheBloke commited on Apr 14, 2023

Commit

27f7a7e

·

1 Parent(s): 95c93be

Update README.md

Files changed (1) hide show

README.md +7 -2

README.md CHANGED Viewed

@@ -17,10 +17,15 @@ It was created by merging the deltas provided in the above repo with the origina
 It was then quantized to 4bit, groupsize 128g, using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
-In my testing this model uses 19 - 21GB of VRAM for inference and therefore should run on any 24GB VRAM card.
-RAM and VRAM usage at the end of a 2000 token response in `text-generation-webui` : **5.2GB RAM, 20.7GB VRAM**
 ![Screenshot of RAM and VRAM Usage](https://i.imgur.com/Sl8SmBH.png)
 ## Provided files

 It was then quantized to 4bit, groupsize 128g, using [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa).
+VRAM usage will depend on the tokens returned. Below approximately 1000 tokens returned it will use <24GB VRAM, but at 1000+ tokens it will exceed the VRAM of a 24GB card.
+RAM and VRAM usage at the end of a 670 token response in `text-generation-webui` : **5.2GB RAM, 20.7GB VRAM**
 ![Screenshot of RAM and VRAM Usage](https://i.imgur.com/Sl8SmBH.png)
+RAM and VRAM usage after about 1500 tokens: **5.2GB RAM, 30.0GB VRAM**
+![screenshot](https://i.imgur.com/PBNtvwf.png)
+If you want a model that should always stay under 24GB, use this one, provided by MetalX, instead:
+[GPT4 Alpaca Lora 30B GPTQ 4bit without groupsize](https://huggingface.co/MetaIX/GPT4-X-Alpaca-30B-Int4)
 ## Provided files