Ivan Peshkov
AI & ML interests
Recent Activity
Organizations
Erilaz's activity
calibration dataset language
Really Fun Model
🚩 Report: Ethical issue(s)
FLUX.1 [dev]
What dose A14 means? Could we get the detail of Qwen MOE architechture?
The model card unclear xD
4k versions load and work in Koboldcpp, but the 128k versions don't.
MAmmoTH2 8x7B and MAmmoTH2 8x7B Plus
<|eot_id|> in aphrodite-engine
The issue is all those experts have to be very diverse and trained more or less simultaneously.
Because if you are going to use sparse MoE, your router model has to be able to predict the fittest expert for the upcoming token. Which means router has to be trained with the experts. That wouldn't be an issue for classic MoE, but both kinds of models also rely on the experts' uniform "understanding" of the cached context. I don't think a 100x2B model would work without that well enough. That's the reason why Mixtral fine-tuning is such a complicated task.
Not only that, we don't really have a good base 2B model. Sure, Phi exists... With 2K ctx length, no GQA, coherency issues and very limited knowledge. I don't think the point of "expert" is providing domain-specific capabilities into the composite model, I think the trick is overcoming the diminishing returns in training, as well as some bandwidth optimizations for inference. So among your 100 experts, one might have both an analog for Grandma cell and some weights associated with division. Another expert could be good at both kinds of ERP - being Enterprise Recourse Planning and the main excuse for Frankenmerges creation, lol. The model distillation becomes better over time, but I don't think any modern 2B can help to compete with GPT-4. Perhaps 16x34B could, but good luck training that from scratch as a relatively small business, let alone nonprofit or private individual.