METAL and FLASH ATTENTION: Why does the Non-Linear IQ4_NL Quant run faster than the IQ4_K_M Quants when I use larger Quants (see file size in GB) of LLaMa-3-8B-Instruct with no issues?

by Joseph717171 - opened 3 days ago

3 days ago

•

@bartowski I have a conundrum. I typically run non-standard IQ6_K - IQ8_0 Quants of LLaMa-3-8B-Instruct, which are larger in file size than the non-standard Quants of Gemma-2-IT that I am running. But, unlike with my Quants of LLaMa-3-8B-Instruct, when I try to run smaller non-standard Quants of Gemma-2-9B-IT, I get significantly lower tok/s generation as the result. Note: my non-standard Quants keep the output tensors and the embeddings in f32 vs bf16, f16, etc. And, for comparison, my non-standard Quants of LLaMa-3-8B-Instruct @Q6_K are 9.94GB, and my non-standard Quants of Gemma-2-IT @IQ4_K_M are 8.69 GB in file size.

My Convert to GGUF setup:

python convert-hf-to-gguf.py $model --outtype f32

My llama.cpp Quant setup:

For LLaMa-3-8B-Instruct

./llama-quantize --imatrix "$imatrix" --leave-output-tensor --allow-requantize --token-embedding-type f32 "$f32_model" "$model"/"$model_name"-F32-IQ6_K.gguf Q6_K 16

For Gemma-2-9B-IT

./llama-quantize --imatrix "$imatrix" --leave-output-tensor --allow-requantize --token-embedding-type f32 "$f32_model" "$model"/"$model_name"-F32-IQ4_K.gguf Q4_K_M 16

For my use-case, I run my models in LM Studio with -fa (Flash Attention) enabled. However, for testing purposes, I also ran my Quants of Gemma-2-9B-IT on the latest version of llama.cpp to see if the issue was with LM Studio, but it wasn't.

Running LM Studio with flash attention enabled works perfectly fine with LLaMa-3-8B-Instruct and others, giving me superior performance, better memory usage and faster inference and more tok/s - this despite the fact that my Quants are larger than that of my Gemma-2-9B-IT Quants.

However, things change when I attempt to run my Quants of Gemma-2-9B-IT with flash attention enabled. When, I do, I immediately notice the token/s generation drops significantly - to a deathly crawl. However, through experimentation, I have found that if I turn off flash attention, and run my IQ4_K_M Quants of Gemma-2-9B-IT, the inverse is also true: the IQ4_K_M Quants run comparatively to my Quants of LLaMa-3-8B-Instruct proportional to their parameter size. This drop in performance and need to turn off flash attention puzzled me. I hypothesized: If normal IQ4_K_M Quants were slower with flash attention enabled, then perhaps Non-Linear IQ4_NL quants would be faster with flash attention enabled. I tried it, and they were. My non-standard IQ4_NL Quant of Gemma-2-9B-IT ran comparatively to LLaMa-3-8B-Instruct proportional to its parameter size. For reference my non-standard IQ4_NL Quant of Gemma-2-9B-IT is 8.36 GB in file size, which is 0.33 GB smaller than that of my IQ4_K_M Quant of it; however, like I said earlier, I have no problems running larger Quants of LLaMa-3-8B-Instruct and other LLMs with flash attention enabled.

Non-Linear Quant of Gemma-2-9B-IT (for reference)

./llama-quantize --imatrix "$imatrix" --leave-output-tensor --allow-requantize --token-embedding-type f32 "$f32_model" "$model"/"$model_name"-F32-IQ4_NL.gguf Q4_NL 16

This is my experience with -fa (Flash Attention) enabled when running Gemma-2-9B-IT. I'm curious if anyone else experiences similar things? This is weird. 🤔

#METAL #FLASH_ATTENTION #IQ4_K_M_SLOWER_THAN_IQ4_NL

Joseph717171 changed discussion title from Why does the Non-Linear IQ4_NL Quant run faster than the IQ4_K_M Quant when I use larger (see file size in GB) Quants of LLaMa-3-8B-Instruct with no issue? to METAL and FLASH ATTENTION: Why does the Non-Linear IQ4_NL Quant run faster than the IQ4_K_M Quants when I use larger (see file size in GB) Quants of LLaMa-3-8B-Instruct with no issues? 3 days ago

Joseph717171 changed discussion title from METAL and FLASH ATTENTION: Why does the Non-Linear IQ4_NL Quant run faster than the IQ4_K_M Quants when I use larger (see file size in GB) Quants of LLaMa-3-8B-Instruct with no issues? to METAL and FLASH ATTENTION: Why does the Non-Linear IQ4_NL Quant run faster than the IQ4_K_M Quants when I use larger Quants (see file size in GB) of LLaMa-3-8B-Instruct with no issues? 3 days ago

bartowski

Owner 3 days ago

this.. puzzles me.. I will need to look into this more, because you're right this doesn't really seem to make any sense. Why would flash attention ever slow things down? I don't use metal too much but I will maybe try to spin it up anyways on my macbook air and see if i can recreate what you see

Joseph717171

3 days ago

•

edited 3 days ago

https://github.com/ggerganov/llama.cpp/pull/8197

It looks like per this llama.cpp PR that Gemma-2 isn't even supposed to use Flash attention. If that is the case, the interesting issues that have unfolded pertaining to significantly lowered Tok/s generation make more sense. 🤔

Joseph717171 changed discussion status to closed 1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment