28 3 91

Yunus Manti

Yuma42

AI & ML interests

I'm generating and running AI to experiment and educate myself.

Recent Activity

replied to bartowski's post 10 days ago

Looks like Q4_0_N_M file types are going away Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable) You can see the reference PR here: https://github.com/ggerganov/llama.cpp/pull/10446 So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms) As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those ! Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541 Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

reacted to bartowski's post with 👍 10 days ago

new activity 26 days ago

open-llm-leaderboard/open_llm_leaderboard:Proposal for new column

View all activity

Organizations

None yet

Yuma42's activity

replied to bartowski's post 10 days ago

Now that the software I'm using updated the llamacpp version, I'm changingbgguf. I don't get what's meant with IQ4_NL does this include IQ4_XS? So IQ4_XS is also supposed to run performant on arm or just Q4_0?

On a side note, since I had good performance with Q4ks in the past I would wish that that would also benefit from these change.

reacted to bartowski's post with 👍 10 days ago

Post

12045

Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

15 replies

New activity in open-llm-leaderboard/open_llm_leaderboard 26 days ago

Proposal for new column

#1032 opened 26 days ago by

Yuma42

liked a model about 1 month ago

allenai/OLMo-2-1124-7B

Updated 10 days ago • 17.4k • 38

liked 2 datasets about 1 month ago

qingy2024/Natural-Text

Viewer • Updated 27 days ago • 106k • 117 • 4

HuggingFaceTB/smoltalk

Viewer • Updated Nov 26, 2024 • 2.2M • 8.89k • 262

liked a model 2 months ago

bartowski/Llama-3.1-SuperNova-8B-Lite_TIES_with_Base-GGUF

Text Generation • Updated Nov 4, 2024 • 272 • 2

reacted to bartowski's post with 👍 2 months ago

Post

23796

In regards to the latest mistral model and GGUFs for it:

Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window

Yes, I got excited when a conversion worked and released them ASAP

That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model

I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !