Wow
Great result, the effect with "max-cpu-IQ4_XS" is amazing. But... This little context, disappears before our eyes so quickly. I tried with rope to extend the context - it works fine with base: 0.85 and frequency: 1000000 at 16k - but it works poorly on a 16GB card. The alternative 4x7 version also does quite well, but these single 7b models "go wrong" more often.
I would like to ask... What if you made a model based on the same principle as this one, but 3x8 with llama 3.1 (yes, I know it is weaker than llama 3). I've been experimenting with llama 3.1 models and these would be my suggestions - in my opinion the smartest ones in RP:
- https://huggingface.co/v000000/L3.1-Niitorm-8B-DPO-t0.0001 (really good one)
- https://huggingface.co/DavidAU/L3.1-RP-Hero-BigTalker-8B-GGUF (yes yours)[
- https://huggingface.co/Sao10K/Llama-3.1-8B-Stheno-v3.4 (He may not be the best on his own, but he has interesting prose.)
Why 3x8b? So that you can run higher guants on 16bg vram cards with decent context.
Thanks for your work, and I wish you more success.
Thank you !
I have added your comments / recommendations to list.
Likely this will be a 4X8 or 2X8 ; as there is some known issues with 3x (or odd number of models) in MOE config'ed models.
NOTE: I will be be making any more models until Feb soonest, as I am creating my own "samplers" / "modules" to control generation in real time at the actual time
of generation (IE streaming response) itself...
Sounds impressive... Fingers crossed, good luck.