Text Generation
Transformers
Safetensors
English
llama
Not-For-All-Audiences
conversational
text-generation-inference
Inference Endpoints

My usual quants.

#1
by ZeroWw - opened

Discord: https://discord.com/channels/@robert_46007

These are my own quantizations (updated almost daily).

The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs

Nothing is Real org

Thank you! I've added the link to the model card

AuriAetherwiing changed discussion status to closed

Sign up or log in to comment