Feedback
Awesome work @xxx777xxxASD ! After testing a QK_6 quant for a bit, I think it rivals some of the 70Bs I've used in terms of creative storytelling and embodying a role of a character. Sometimes, I found myself smiling at how good the outputs were for a model of this size.
However, it has a tendecy to hallucinate body parts of characters (e.g. a female having male organs). Sometimes, it even gives says things that probably belong in r/BrandNewSentence (e.g. bulge in a sports bra). These were dealbreakers for me, but others may be fine with editing the mistakes out and just pressing on. This model seems to become more coherent the more context it has to work with, and the hallucinations should lessen as the chat gets longer.
I think 4x8B models have lots of potential, however, and I can see future versions potentially being much faster alternatives to 70B dense models that are just harder to run and can be inaccessible to most.
Awesome work @xxx777xxxASD ! After testing a QK_6 quant for a bit, I think it rivals some of the 70Bs I've used in terms of creative storytelling and embodying a role of a character. Sometimes, I found myself smiling at how good the outputs were for a model of this size.
However, it has a tendecy to hallucinate body parts of characters (e.g. a female having male organs). Sometimes, it even gives says things that probably belong in r/BrandNewSentence (e.g. bulge in a sports bra). These were dealbreakers for me, but others may be fine with editing the mistakes out and just pressing on. This model seems to become more coherent the more context it has to work with, and the hallucinations should lessen as the chat gets longer.
I think 4x8B models have lots of potential, however, and I can see future versions potentially being much faster alternatives to 70B dense models that are just harder to run and can be inaccessible to most.
I wonder if this may be caused by the gate mode
There are three methods for populating the MoE gates implemented.
- "hidden"
Uses the hidden state representations of the positive/negative prompts for MoE gate parameters. Best quality and most effective option; the default. Requires evaluating each prompt using the base model so you might not be able to use this on constrained hardware (depending on the model). You can use --load-in-8bit or --load-in-4bit to reduce VRAM usage.
- "cheap_embed"
Uses only the raw token embedding of the prompts, using the same gate parameters for every layer. Distinctly less effective than "hidden". Can be run on much, much lower end hardware.
- "random"
Randomly initializes the MoE gates. Good for if you are going to fine tune the model afterwards, or maybe if you want something a little unhinged? I won't judge.