phixtral-3x2_8
I tried 3 experts (dropped coder) and 2 num_experts_per_tok
AGI_EVAL: 33.93
GPT4ALL: 70.50
TruthfullQA: 48.78
BigBench: 37.84
Average: 47.76
So the fact that we can't really improve beyond the smallest phixtral-2x2_8 proves that this is not the fault of Coder. This is just by design because all models share the same base phi-2 that has just been finetuned. If we really want to improve this work we would need extra phi-2 base model pretrained with different seeds. However it might not work with mergekit which retain only one giver base model for the attention layers.
If someone wants to try a DPO on the phixtral-4x2_8 we may have a good surprise but I doubt. Let me know if interested, I'll pack up the additional code for that.
Interesting results, thanks!