Joseph717171/Llama-3.1-SuperNova-Lite-8.0B-OQ8_0-F32.EF32.IQ4_K-Q8_0-GGUF

Nov 30, 2024

•

edited Nov 30, 2024

I like this model :-) Used it for summarization tasks (scientific papers). Not perfect, but the best I found in the < 8b range so far (but I am doing slow manual testing). The model speeds up my workflow a little. I strongly recommend to read the paper(s), then do the summarization (with a good system prompt) and then fix any errors that emerge. In science, we can't have hallucinations / confabulations.

I used the Llama-3.1-SuperNova-Lite-8.0B-OQ8_0.EF32.IQ4_K_M.gguf version.

ThiloteE

Nov 30, 2024

If I may ask, how do you upscale the embedding layer? I quantized models with llama.cpp before and I know about the --leave-output-tensors option, but I haven't seen anything for embedding layers...

Joseph717171

Owner Dec 18, 2024

•

edited Dec 18, 2024

Great Question! 😋

Output taken from llama-quantize --help:
--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor

For example (with Imatrix):

f32_model_path="/Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/Llama-3.1-SuperNova-Lite-8.0B-F32.gguf"
quantized_model_path="/Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/Llama-3.1-SuperNova-Lite-8.0B-OF32.EF32.IQ8_0.gguf"

./llama-quantize --imatrix /Users/jsarnecki/opt/llama.cpp/build/bin/Llama-3.1-SuperNova-Lite-8.0B-F32.imatrix --output-tensor-type F32 --token-embedding-type F32 /Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/Llama-3.1-SuperNova-Lite-8.0B-F32.gguf /Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/Llama-3.1-SuperNova-Lite-8.0B-OF32.EF32.IQ8_0.gguf Q8_0

For example (Sans Imatrix):

./llama-quantize --output-tensor-type F32 --token-embedding-type F32 /Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/Llama-3.1-SuperNova-Lite-8.0B-F32.gguf /Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/Llama-3.1-SuperNova-Lite-8.0B-OF32.EF32.Q8_0.gguf Q8_0

Joseph717171

Owner Dec 18, 2024

•

edited Dec 18, 2024

I like this model :-) Used it for summarization tasks (scientific papers). Not perfect, but the best I found in the < 8b range so far (but I am doing slow manual testing). The model speeds up my workflow a little. I strongly recommend to read the paper(s), then do the summarization (with a good system prompt) and then fix any errors that emerge. In science, we can't have hallucinations / confabulations.

I used the Llama-3.1-SuperNova-Lite-8.0B-OQ8_0.EF32.IQ4_K_M.gguf version.

I'm so glad that you like and enjoy the model. Please, try out the latest version of Llama-3.1-SuperNova-Lite-8.0B-OQ8_0.EF32.IQ4_K_M.gguf

ThiloteE

Dec 18, 2024

•

edited Dec 18, 2024

Thank you so much for your helpful response. I will try out the quantization features.

I do like the model. In its size class it is definitely a top model, but to be honest, I experimented a lot more with summarization after I wrote those comments and found that the Q4_0 https://huggingface.co/bartowski/SuperNova-Medius-GGUF, which is a finetune of Qwen/Qwen2.5-14B, fulfills my needs much better. My hypotheses is that the 8b model is not good enough to deal with my very long system prompt and instruction (I want the model to address a list of very specific "things" in the summaries, based on the 15-35k tokens input of the scientific paper). The Llama-3.1-SuperNova-Lite-8.0B always had "holes" in the summaries, where it did "forget" to respond to some things that I tasked it with, like a student or worker that didn't have enough time to finish an exam or the job. I also have the suspicion, addressing all the things in the response would exceed the tokens it was trained to respond with. I think Models have an average response length after all. The things it did summarize was sometimes good, sometimes bad. The 14b model rarely has those "holes" and mostly addresses everything, but yes sometimes better, sometimes not so good, but at least it addresses everything :D

There are many new models coming out and I wonder how a 28-32b model would behave, but I don't have the hardware for that. Even the 14b model is quite slow on my system, so I only use it for important things. Anyway, thanks!

Joseph717171

Owner Dec 19, 2024

•

edited Dec 19, 2024

I like this model :-) Used it for summarization tasks (scientific papers). Not perfect, but the best I found in the < 8b range so far (but I am doing slow manual testing). The model speeds up my workflow a little. I strongly recommend to read the paper(s), then do the summarization (with a good system prompt) and then fix any errors that emerge. In science, we can't have hallucinations / confabulations.

I used the Llama-3.1-SuperNova-Lite-8.0B-OQ8_0.EF32.IQ4_K_M.gguf version.

Thank you so much for your helpful response. I will try out the quantization features.

I do like the model. In its size class it is definitely a top model, but to be honest, I experimented a lot more with summarization after I wrote those comments and found that the Q4_0 https://huggingface.co/bartowski/SuperNova-Medius-GGUF, which is a finetune of Qwen/Qwen2.5-14B, fulfills my needs much better. My hypotheses is that the 8b model is not good enough to deal with my very long system prompt and instruction (I want the model to address a list of very specific "things" in the summaries, based on the 15-35k tokens input of the scientific paper). The Llama-3.1-SuperNova-Lite-8.0B always had "holes" in the summaries, where it did "forget" to respond to some things that I tasked it with, like a student or worker that didn't have enough time to finish an exam or the job. I also have the suspicion, addressing all the things in the response would exceed the tokens it was trained to respond with. I think Models have an average response length after all. The things it did summarize was sometimes good, sometimes bad. The 14b model rarely has those "holes" and mostly addresses everything, but yes sometimes better, sometimes not so good, but at least it addresses everything :D

There are many new models coming out and I wonder how a 28-32b model would behave, but I don't have the hardware for that. Even the 14b model is quite slow on my system, so I only use it for important things. Anyway, thanks!

Awesome! Yeah, SuperNova-Medius is probably better trained. Keep in mind, SuperNova-Lite was trained before SuperNova-Medius, so any mistakes and improvements that Arcee-AI became aware of and learned from were taken into full consideration when they trained SuperNova-Medius.

Here's the latest imatrix for SuperNova-Lite: Llama-3.1-SuperNova-Lite-8.0B-F32.imatrix, please use it for best results.

Joseph717171

Owner Dec 19, 2024

•

edited Dec 19, 2024

Also: IBM recently released Granite-3.1-8B-Instruct, so look out for a new repo of O.E GGUF quants as well - quants as soon as the imatrix for the F32 of the model is finished being computed. Probably pushed to HuggingFace within the next 5 hours. 😋

https://huggingface.co/ibm-granite/granite-3.1-8b-instruct

The F32 imatrix is done being computed, so if anyone wants to use it to make their own quantizations while I make mine here it is: granite-3.1-8B-instruct-F32.imatrix

Joseph717171
/

Llama-3.1-SuperNova-Lite-8.0B-OQ8_0-F32.EF32.IQ4_K-Q8_0-GGUF

Good for summarization.