Why is there only one shared expert in the code?
Great job!
I've read your blog and the code for Hugging Face Transformers. In the blog, I came across this description: In the case of the Qwen1.5-MoE-A2.7B model, we have incorporated 4 shared experts to be always activated alongside 60 routing experts with 4 to be activated. These four consistently activated shared experts provide a more adaptive approach. However, in the code, I only saw one shared expert. Why is that?
This is the code from modeling_qwen2_moe.py.
self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
self.experts = nn.ModuleList(
[Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]
)
# Only one shared expert??
self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size)
self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False)
Again, Thanks for your great job!
I guess that only 1 shared expert, but its intermediate_size(5632) is 4x larger than small expert(1408). so-called 4 shared experts.
I guess that only 1 shared expert, but its intermediate_size(5632) is 4x larger than small expert(1408). so-called 4 shared experts.
emmm... It makes sense. 🐶