metadata

title: README
emoji: 🌍
colorFrom: red
colorTo: gray
sdk: static
pinned: false

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

The total number of model parameters is only 6.7B, which is friendly for deployment and research usage.

Model	#Activated Experts	#Experts	#Activated Params	Links
LLaMA-MoE-3.0B	2	16	3.0B	[🤗 HF Weights]
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	[🤗 HF Weights]
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	[🤗 HF Weights]

Model	Average	SciQ	PIQA	WinoGrande	ARC-e	ARC-c (25)	HellaSwag (10)	LogiQA	BoolQ (32)	LAMBADA	NQ (32)	MMNLU (5)
OPT-2.7B	50.3	78.9	74.8	60.8	54.4	34.0	61.4	25.8	63.3	63.6	10.7	25.8
Pythia-2.8B	51.5	83.2	73.6	59.6	58.8	36.7	60.7	28.1	65.9	64.6	8.7	26.8
INCITE-BASE-3B	53.7	85.6	73.9	63.5	61.7	40.3	64.7	27.5	65.8	65.4	15.2	27.2
Open-LLaMA-3B-v2	55.6	88.0	77.9	63.1	63.3	40.1	71.4	28.1	69.2	67.4	16.0	26.8
Sheared-LLaMA-2.7B	56.4	87.5	76.9	65.0	63.3	41.6	71.0	28.3	73.6	68.3	17.6	27.3
LLaMA-MoE-3.0B	55.5	84.2	77.5	63.6	60.2	40.9	70.8	30.6	71.9	66.6	17.0	26.8
LLaMA-MoE-3.5B (4/16)	57.7	87.6	77.9	65.5	65.6	44.2	73.3	29.7	75.0	69.5	20.3	26.8
LLaMA-MoE-3.5B (2/8)	57.6	88.4	77.6	66.7	65.3	43.1	73.3	29.6	73.9	69.4	19.8	27.0