metadata
title: README
emoji: π
colorFrom: red
colorTo: gray
sdk: static
pinned: false
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
π’ A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!
LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:
- Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
- Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.
The total number of model parameters is only 6.7B, which is friendly for deployment and research usage.
Model | #Activated Experts | #Experts | #Activated Params | Links |
---|---|---|---|---|
LLaMA-MoE-3.0B | 2 | 16 | 3.0B | [π€ HF Weights] |
LLaMA-MoE-3.5B (4/16) | 4 | 16 | 3.5B | [π€ HF Weights] |
LLaMA-MoE-3.5B (2/8) | 2 | 8 | 3.5B | [π€ HF Weights] |
Model | Average | SciQ | PIQA | WinoGrande | ARC-e | ARC-c (25) | HellaSwag (10) | LogiQA | BoolQ (32) | LAMBADA | NQ (32) | MMNLU (5) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-2.7B | 50.3 | 78.9 | 74.8 | 60.8 | 54.4 | 34.0 | 61.4 | 25.8 | 63.3 | 63.6 | 10.7 | 25.8 |
Pythia-2.8B | 51.5 | 83.2 | 73.6 | 59.6 | 58.8 | 36.7 | 60.7 | 28.1 | 65.9 | 64.6 | 8.7 | 26.8 |
INCITE-BASE-3B | 53.7 | 85.6 | 73.9 | 63.5 | 61.7 | 40.3 | 64.7 | 27.5 | 65.8 | 65.4 | 15.2 | 27.2 |
Open-LLaMA-3B-v2 | 55.6 | 88.0 | 77.9 | 63.1 | 63.3 | 40.1 | 71.4 | 28.1 | 69.2 | 67.4 | 16.0 | 26.8 |
Sheared-LLaMA-2.7B | 56.4 | 87.5 | 76.9 | 65.0 | 63.3 | 41.6 | 71.0 | 28.3 | 73.6 | 68.3 | 17.6 | 27.3 |
LLaMA-MoE-3.0B | 55.5 | 84.2 | 77.5 | 63.6 | 60.2 | 40.9 | 70.8 | 30.6 | 71.9 | 66.6 | 17.0 | 26.8 |
LLaMA-MoE-3.5B (4/16) | 57.7 | 87.6 | 77.9 | 65.5 | 65.6 | 44.2 | 73.3 | 29.7 | 75.0 | 69.5 | 20.3 | 26.8 |
LLaMA-MoE-3.5B (2/8) | 57.6 | 88.4 | 77.6 | 66.7 | 65.3 | 43.1 | 73.3 | 29.6 | 73.9 | 69.4 | 19.8 | 27.0 |