New quant 8bit method, how is it performing on your CPU? (share your token/s, CPU model and -- thread)

#2
by alphaprime90 - opened

Thank you TheBloke for the various quants options. How is 8bit treating you?

in my case 8 bit is actually around 30% faster than 4 or 5 bit. But it uses more RAM.
Windows 11, Ryzen 5900x

Wow faster? That's surprising.

I've not tested 8bit at all yet to be honest. I just threw it in there for completeness. I I'll give it a go later to see how it goes

in my case 8 bit is actually around 30% faster than 4 or 5 bit. But it uses more RAM.
Windows 11, Ryzen 5900x

Sir, may I know how much ram does it consume? I have 16gb, will that be enough? thanks!

I have a doubt.
Are 4_0 models faster than 5_0 models ?

in my case 8 bit is actually around 30% faster than 4 or 5 bit. But it uses more RAM.
Windows 11, Ryzen 5900x

Nice. I have a doubt. Which one is faster 4 bit or 5 bit?

unfortunatelly i was trying different versions of llama-cpp-python and llama.cpp (mainly because of the new ggml version and gpu offloading) and i am not able to reproduce it on my main computer anymore, now the q8_0 is slower. But on laptop i still have older version of oobabooga webui and there the q8_0 is faster than q5_1 . Even the readme of llama.cpp say that q8_0 is faster in some cases. Although there is one good thing on q8_0, unlike q5_1 (i am not sure about the other types), even the older models works with newest llama.cpp
pip list: https://pastebin.com/sLkgSe2s
image.png
image.png
image.png

Nice! Thanks for the explanation.
I am also glad that this model is working with my old laptop. I will keep on using that version.
This new Wiz-Vic 13B Uncensored model is really amazing.

I'm running on Dell R520 (2 x CPU E5-2470 v2 @ 2.40GHz - 40 threads in total). However, I installed Proxmox & created a VM just for generative AI experiments.

This is the VM specs:

  • 32GB RAM
  • 16 vCPU (threads) with only AVX enabled, not AVX2, AVX512,...

I'm running q8_0 with these args:

--cpu-memory 20
--threads 16 
--wbits 8 
--groupsize 128 
--cpu 
--listen 
--chat 
--verbose 
--no-stream 
--extensions long_term_memory sd_api_pictures send_pictures 
--model Wizard-Vicuna-13B-Uncensored.ggml.q8_0

Normally, I will get a performance ~ 1.0 ~ 1.2 tokens/s (~4-5m for 200 tokens)

Seems like AVX2 makes a bit difference.
With 13B models, I get around 360ms per token with q8_0, on only 6 threads, on my budget R7 5700G.
Although I still prefer q4_0 for that extra speed (200ms per token).

I have with i9 9900 and q5_1 312 ms/token but with -ngl 20 getting 190 ms/token :)

I can't get -ngl to work for the life of me, keep getting out of memory... :(

Interesting, which tool are you using to run this model? I'm using oobabooga but don't see that flag

Llama.cpp using that -ngl
I have only 8GB card etc 3060 so for 13B model i can fit only 20 layers .
But 7B models I can fit all layers on my GPU ( 32 ) and models works 3x faster ...180ms vsc 58 ms

Newest kobold as well but that parameter has different name .

I think it's time to buy etc 3090 with 24 GB as are very cheap now - second hand I can buy for 600 euro :D .

@mirek190 you can put 33 layers into vram (+1 output layer), might get even faster :)

I can't get -ngl to work for the life of me, keep getting out of memory... :(

if you are using oobabooga it could be because llama.cpp doesnt free vram, i always have to close the app completely and turn again to free the vram after i use ngl

@mirek190 you can put 33 layers into vram (+1 output layer), might get even faster :)

Omg ... you're right ... adding one extra layer giving even better speed πŸ˜„

I can't get -ngl to work for the life of me, keep getting out of memory... :(

if you are using oobabooga it could be because llama.cpp doesnt free vram, i always have to close the app completely and turn again to free the vram after i use ngl

I'm actually using https://github.com/ggerganov/llama.cpp directly, have not tried this with oobabooga.

I have a 3090 here and with 24GB I'd think any 13B 4bit model should be working without any issues. Yet no matter the number of layers I set, I always get OOM. It works without it, but then what's the point, no? :)

There was some issue logged in the llama.cpp repo about this, but they marked it as resolved. Guess it hasn't been truly resolved, and I'm stumped.

Strange.. for me -ngl works even with 65b model but I can fit omly 7 layers on 8 GB card :)

It looks like that for me with -ngl
7B - 32
13B - 25
30B - 14
65B - 7

I can't get -ngl to work for the life of me, keep getting out of memory... :(

if you are using oobabooga it could be because llama.cpp doesnt free vram, i always have to close the app completely and turn again to free the vram after i use ngl

I'm actually using https://github.com/ggerganov/llama.cpp directly, have not tried this with oobabooga.

I have a 3090 here and with 24GB I'd think any 13B 4bit model should be working without any issues. Yet no matter the number of layers I set, I always get OOM. It works without it, but then what's the point, no? :)

There was some issue logged in the llama.cpp repo about this, but they marked it as resolved. Guess it hasn't been truly resolved, and I'm stumped.

i see, i haven't tried it directly. I am using oobabooga and it works well for me. the only downside is that i have to close oobabooga completely before i switch to new model because of the memory releasing problem. I also have 3090 but i havent found any benefit in using ngl for 13b models, in that case, GPTQ is nearly 3x faster. I use ngl for 65B models, i can offload half of the model in gpu and the speed is about 66% faster than purely on cpu. 1.66t/s in my case (3090ti, ryzen 5900x, 64GB RAM, linux). on windows i had just 1t/s in same settings

Wait ...you have 1.6t/s with half layers on rtx 3090 ?
That's actually slow .
I have on CPU ( i9 9900 ) 64 GB ram 1.3 t/s ....
But I'm using llama.cpp

i am talking about 65B model.. oobabooga+llama.cpp with 40 layers offloaded to gpu. after some tweaks i have 1.7t/s. you say that it can be improved?

Sorry that is my mistake ... I thought you said 1.6 s/t ...lol
You actually said 1.6t/s with rtx 3090 πŸ˜….

I went ahead and installed CUDA-12.1 thinking that llama.cpp might need the latest CUDA to compile with CUBLAS, but even with that I'm getting OOM. And this is with the latest git pull as of now.

Starting llama.cpp with the following line:

./main -t 20 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -m models/wizard-mega-13B.ggml.q5_0.bin -r "user " --interactive-first -ngl 40

This works in native Windows (~ 6t/s) using the pre-compiled binary with CUBLAS support, but in WSL (which I prefer to run in) it doesn't and gives OOM.
....
WARNING: failed to allocate 1602.00 MB of pinned memory: out of memory
llama_init_from_file: kv self size = 1600.00 MB
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
...
CUDA error 2 at ggml-cuda.cu:693: out of memory

So....

For anyone who is trying this in WSL, it appears Pinned Memory is limited, as per https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-applications

The solution for this is to add enviroment variable GGML_CUDA_NO_PINNED=1, as per https://github.com/abetlen/llama-cpp-python/issues/229#issuecomment-1553800926

Actually, I realized I made a mistake...I started the command with 20 threads but only had 12 enabled in WSL, duh.

With -t 12 and -ngl 40 I'm getting a very usable speed, finally, ~ 10t/s

EDIT:

Not sure if I'm doing this right (running a test), but for example, running:

GGML_CUDA_NO_PINNED=1 ./main -t 12 -ngl 40 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -m models/wizard-mega-13B.ggml.q5_0.bin -n 2048 --mtest -p "### Instruction: write a story about llamas ### Response:"

I get this result back:

llama_print_timings: load time = 11020.03 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token)
llama_print_timings: prompt eval time = 4942.57 ms / 512 tokens ( 9.65 ms per token)
llama_print_timings: eval time = 136.34 ms / 1 runs ( 136.34 ms per token)
llama_print_timings: total time = 11156.38 ms

Is there a better way to get some measurements of performance?

Sign up or log in to comment