Wow

#2
by Danioken - opened

Great result, the effect with "max-cpu-IQ4_XS" is amazing. But... This little context, disappears before our eyes so quickly. I tried with rope to extend the context - it works fine with base: 0.85 and frequency: 1000000 at 16k - but it works poorly on a 16GB card. The alternative 4x7 version also does quite well, but these single 7b models "go wrong" more often.
I would like to ask... What if you made a model based on the same principle as this one, but 3x8 with llama 3.1 (yes, I know it is weaker than llama 3). I've been experimenting with llama 3.1 models and these would be my suggestions - in my opinion the smartest ones in RP:

Why 3x8b? So that you can run higher guants on 16bg vram cards with decent context.

Thanks for your work, and I wish you more success.

Thank you !
I have added your comments / recommendations to list.
Likely this will be a 4X8 or 2X8 ; as there is some known issues with 3x (or odd number of models) in MOE config'ed models.

NOTE: I will be be making any more models until Feb soonest, as I am creating my own "samplers" / "modules" to control generation in real time at the actual time
of generation (IE streaming response) itself...

Sounds impressive... Fingers crossed, good luck.

Sign up or log in to comment