README / README.md
Spico's picture
Update README.md
7f97e19
|
raw
history blame
4.18 kB
metadata
title: README
emoji: 🌍
colorFrom: red
colorTo: gray
sdk: static
pinned: false

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

πŸ“’ A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

  1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
  2. Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

The total number of model parameters is only 6.7B, which is friendly for deployment and research usage.

Model #Activated Experts #Experts #Activated Params Links
LLaMA-MoE-3.0B 2 16 3.0B [πŸ€— HF Weights]
LLaMA-MoE-3.5B (4/16) 4 16 3.5B [πŸ€— HF Weights]
LLaMA-MoE-3.5B (2/8) 2 8 3.5B [πŸ€— HF Weights]
Model Average SciQ PIQA WinoGrande ARC-e ARC-c (25) HellaSwag (10) LogiQA BoolQ (32) LAMBADA NQ (32) MMNLU (5)
OPT-2.7B 50.3 78.9 74.8 60.8 54.4 34.0 61.4 25.8 63.3 63.6 10.7 25.8
Pythia-2.8B 51.5 83.2 73.6 59.6 58.8 36.7 60.7 28.1 65.9 64.6 8.7 26.8
INCITE-BASE-3B 53.7 85.6 73.9 63.5 61.7 40.3 64.7 27.5 65.8 65.4 15.2 27.2
Open-LLaMA-3B-v2 55.6 88.0 77.9 63.1 63.3 40.1 71.4 28.1 69.2 67.4 16.0 26.8
Sheared-LLaMA-2.7B 56.4 87.5 76.9 65.0 63.3 41.6 71.0 28.3 73.6 68.3 17.6 27.3
LLaMA-MoE-3.0B 55.5 84.2 77.5 63.6 60.2 40.9 70.8 30.6 71.9 66.6 17.0 26.8
LLaMA-MoE-3.5B (4/16) 57.7 87.6 77.9 65.5 65.6 44.2 73.3 29.7 75.0 69.5 20.3 26.8
LLaMA-MoE-3.5B (2/8) 57.6 88.4 77.6 66.7 65.3 43.1 73.3 29.6 73.9 69.4 19.8 27.0