So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp

It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important

However what the quantization then does with that information is where I was wrong.

I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW

Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations

The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group

But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix

Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up

5 replies

liked 2 models 3 months ago

SandLogicTechnologies/Meta-Llama-3-8B-Instruct-GGUF

Text Generation • Updated Sep 10 • 14 • 2

SandLogicTechnologies/Phi-3.1-mini-4k-instruct-GGUF

Text Generation • Updated Aug 12 • 20 • 2

Reacted to as-cle-bert's post with 🚀 4 months ago

Post

5046

Hi HF Community!🤗

In the past days, OpenAI announced their search engine, SearchGPT: today, I'm glad to introduce you SearchPhi, an AI-powered and open-source web search tool that aims to reproduce similar features to SearchGPT, built upon microsoft/Phi-3-mini-4k-instruct, llama.cpp🦙 and Streamlit.
Although not as capable as SearchGPT, SearchPhi v0.0-beta.0 is a first step toward a fully functional and multimodal search engine :)
If you want to know more, head over to the GitHub repository (https://github.com/AstraBert/SearchPhi) and, to test it out, use this HF space: as-cle-bert/SearchPhi
Have fun!🐱

liked a model 4 months ago

meta-llama/Llama-3.1-405B

Text Generation • Updated Sep 25 • 5.94M • 885

New activity in QuantFactory/Meta-Llama-3-8B-Instruct-GGUF 4 months ago

how to integrate in rasa chat bot

#18 opened 6 months ago by

alimography

Reacted to victor's post with ❤️ 5 months ago

Post

2207

Hi @jonoirwin ! Big fan of https://fastvoiceagent.cerebrium.ai/ 🔥
I'd be super happy to give you a GPU grant to host it on a Space, it would allow more people to discover and use it!

1 reply

New activity in AetherArchitectural/GGUF-Quantization-Script 5 months ago

Error when running gguf-imat-for-FP16.py file

#33 opened 5 months ago by

RakshitAralimatti

[merged experimental] Potential Linux support.

#32 opened 5 months ago by

RakshitAralimatti

Reacted to KingNish's post with 🔥➕ 6 months ago

Post

4626

Microsoft Just Launched 3 Powerful Models

1. Phi 3 Medium (4k and 128k): A 14b Instruct tuned models that outperformed big models like Command R+ (104b), GPT 3.5 Pro, Gemini Pro, and is highly competitive with top models such as Mixtral 8x22b, Llama3 70B, and GPT 4.
microsoft/Phi-3-medium-4k-instruct
DEMO: https://huggingface.co/spaces/Walmart-the-bag/Phi-3-Medium

2. Phi 3 Mini Vision 128k: A 4.5 billion-parameter, instruction-tuned vision model that has outperformed models such as Llava3 and Claude 3, and is providing stiff competition to Gemini 1Pro Vision.
microsoft/Phi-3-vision-128k-instruct

3. Phi3 Small (8k and 128k): Better than Llama3 8b, Mixtral 8x7b and GPT 3.5 turbo.
microsoft/Phi-3-small-128k-instruct