22 1 4

Ivan Peshkov

Erilaz

AI & ML interests

None yet

Recent Activity

new activity 17 days ago

mradermacher/Vikhr-Nemo-dostoevsky-saiga-12b-i1-GGUF:calibration dataset language

new activity about 1 month ago

Sao10K/14B-Qwen2.5-Kunou-v1:Really Fun Model

new activity 5 months ago

black-forest-labs/FLUX.1-dev:🚩 Report: Ethical issue(s)

View all activity

Organizations

None yet

Erilaz's activity

New activity in mradermacher/Vikhr-Nemo-dostoevsky-saiga-12b-i1-GGUF 17 days ago

calibration dataset language

#1 opened 17 days ago by

Erilaz

New activity in Sao10K/14B-Qwen2.5-Kunou-v1 about 1 month ago

Really Fun Model

#3 opened about 1 month ago by

isr431

New activity in black-forest-labs/FLUX.1-dev 5 months ago

🚩 Report: Ethical issue(s)

#56 opened 6 months ago by

WWHugFace

liked 2 Spaces 6 months ago

Running on Zero

4.09k

🏎️💨

FLUX.1 [dev]

New activity in Qwen/Qwen2-57B-A14B-Instruct 8 months ago

What dose A14 means? Could we get the detail of Qwen MOE architechture?

#1 opened 8 months ago by

JohnSaxon

New activity in mistralai/Codestral-22B-v0.1 8 months ago

What is the context size on this model? And it does not appear to deal with JSON, function calling well.

#15 opened 8 months ago by

BigDeeper

New activity in failspy/Phi-3-mini-4k-geminified 8 months ago

The model card unclear xD

#1 opened 8 months ago by

Erilaz

New activity in bartowski/Phi-3-medium-128k-instruct-GGUF 8 months ago

4k versions load and work in Koboldcpp, but the 128k versions don't.

#1 opened 8 months ago by

YuriGagarine

New activity in mradermacher/model_requests 8 months ago

MAmmoTH2 8x7B and MAmmoTH2 8x7B Plus

#58 opened 8 months ago by

Erilaz

New activity in lightblue/suzume-llama-3-8B-multilingual 9 months ago

<|eot_id|> in aphrodite-engine

#2 opened 9 months ago by

med4u

replied to alielfilali01's post 10 months ago

The issue is all those experts have to be very diverse and trained more or less simultaneously.
Because if you are going to use sparse MoE, your router model has to be able to predict the fittest expert for the upcoming token. Which means router has to be trained with the experts. That wouldn't be an issue for classic MoE, but both kinds of models also rely on the experts' uniform "understanding" of the cached context. I don't think a 100x2B model would work without that well enough. That's the reason why Mixtral fine-tuning is such a complicated task.
Not only that, we don't really have a good base 2B model. Sure, Phi exists... With 2K ctx length, no GQA, coherency issues and very limited knowledge. I don't think the point of "expert" is providing domain-specific capabilities into the composite model, I think the trick is overcoming the diminishing returns in training, as well as some bandwidth optimizations for inference. So among your 100 experts, one might have both an analog for Grandma cell and some weights associated with division. Another expert could be good at both kinds of ERP - being Enterprise Recourse Planning and the main excuse for Frankenmerges creation, lol. The model distillation becomes better over time, but I don't think any modern 2B can help to compete with GPT-4. Perhaps 16x34B could, but good luck training that from scratch as a relatively small business, let alone nonprofit or private individual.

New activity in mradermacher/model_requests 10 months ago