Joseph717171
commited on
Commit
β’
52a22a3
1
Parent(s):
7add0c1
Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,3 @@
|
|
1 |
Custom GGUF quants of arcee-aiβs [Llama-3.1-SuperNova-Lite-8B](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite), where the Output Tensors are quantized to Q8_0 while the Embeddings are kept at F32. Enjoy! π§ π₯π
|
2 |
|
3 |
-
|
4 |
-
The original OQ8_0.EF32.IQuants will remain in the repo for those who want to use them. Cheers! π
|
5 |
-
|
6 |
-
Addendum: The OQ8_0.EF32.IQuants are right for the model's size; I was just being naive: I was comparing my OQ8_0.EF32 IQuants of Llama-3.1-SuperNova-Lite-8B to that of my OQ8_0.EF32 IQuants of Hermes-3-Llama-3.1-8B - thinking they were both the same size as my OQ8_0.EF32.IQuants of LLama-3.1-8B-Instruct; they're not: Hereme-3-Llama-3.1-8B is bigger. So, now we have both OQ8_0.EF32.IQuants and OF32.EF32.IQuants, and they're both great quant schemes. The only difference is being, of course, that OF32.EF32.IQuants have even more accuracy at the expense of more vRAM. So, there you have it. I'm a dumbass, but its okay because I learned something, and now we have even more quantizations to play with now. Cheers! ππ
|
|
|
1 |
Custom GGUF quants of arcee-aiβs [Llama-3.1-SuperNova-Lite-8B](https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite), where the Output Tensors are quantized to Q8_0 while the Embeddings are kept at F32. Enjoy! π§ π₯π
|
2 |
|
3 |
+
PDATE: This repo now contains updated O.E.IQuants, which were quantized, using a new F32-imatrix, using llama.cpp version: 4067 (54ef9cfc). This particular version of llama.cpp made it so all KQ mat_mul computations were done in F32 vs BF16, when using FA (Flash Attention). This change, plus the other very impactful prior change, which made all KQ mat_muls be computed with F32 (float32) precision for CUDA-Enabled devices, has compoundedly enhanced the O.E.IQuants and has made it excitingly necessary for this update to be pushed. Cheers!
|
|
|
|
|
|