Ivan Peshkov's picture

Ivan Peshkov

Erilaz
·

AI & ML interests

None yet

Recent Activity

Organizations

None yet

Erilaz's activity

New activity in Sao10K/14B-Qwen2.5-Kunou-v1 about 1 month ago

Really Fun Model

1
#3 opened about 1 month ago by
isr431
New activity in black-forest-labs/FLUX.1-dev 5 months ago

🚩 Report: Ethical issue(s)

1
#56 opened 6 months ago by
WWHugFace
New activity in failspy/Phi-3-mini-4k-geminified 8 months ago

The model card unclear xD

2
#1 opened 8 months ago by
Erilaz
New activity in mradermacher/model_requests 8 months ago
New activity in lightblue/suzume-llama-3-8B-multilingual 9 months ago

<|eot_id|> in aphrodite-engine

1
#2 opened 9 months ago by
med4u
replied to alielfilali01's post 10 months ago
view reply

The issue is all those experts have to be very diverse and trained more or less simultaneously.
Because if you are going to use sparse MoE, your router model has to be able to predict the fittest expert for the upcoming token. Which means router has to be trained with the experts. That wouldn't be an issue for classic MoE, but both kinds of models also rely on the experts' uniform "understanding" of the cached context. I don't think a 100x2B model would work without that well enough. That's the reason why Mixtral fine-tuning is such a complicated task.
Not only that, we don't really have a good base 2B model. Sure, Phi exists... With 2K ctx length, no GQA, coherency issues and very limited knowledge. I don't think the point of "expert" is providing domain-specific capabilities into the composite model, I think the trick is overcoming the diminishing returns in training, as well as some bandwidth optimizations for inference. So among your 100 experts, one might have both an analog for Grandma cell and some weights associated with division. Another expert could be good at both kinds of ERP - being Enterprise Recourse Planning and the main excuse for Frankenmerges creation, lol. The model distillation becomes better over time, but I don't think any modern 2B can help to compete with GPT-4. Perhaps 16x34B could, but good luck training that from scratch as a relatively small business, let alone nonprofit or private individual.

New activity in NeverSleep/MiquMaid-v1-70B-GGUF 12 months ago

Memory usage

6
#2 opened 12 months ago by
Ainonake
New activity in InstantX/InstantID 12 months ago

ControlNet model optimizations

#3 opened 12 months ago by
Erilaz
New activity in Undi95/OpenDolphinMaid-4x7b-GGUF 12 months ago

Great Local Model

4
#1 opened about 1 year ago by
morgul
New activity in Doctor-Shotgun/Nous-Capybara-limarpv3-34B about 1 year ago

Prompting syntax?

7
#1 opened about 1 year ago by
brucethemoose
New activity in TheBloke/dolphin-2_6-phi-2-GGUF about 1 year ago

Long conversations issue

1
#1 opened about 1 year ago by
vbuhoijymzoi