LLaMA-MoE-v2-3.8B (1+1/7) SFT

[πŸ’» Code] | [πŸ“ƒ Technical Report]

LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA3. We build LLaMA-MoE-v2 with the following two steps:

  1. Partition LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
  2. Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.
Model #Activated Experts #Experts #Activated Params SFT Model
LLaMA-MLP-MoE (2/8) 2 8 3.8B πŸ€— SFT
LLaMA-MLP-MoE (1+1/7) 2 8 3.8B πŸ€— SFT

πŸš€ QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v2-3_8B-residual-sft"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.cuda()

input_text = "Could you recommend me some mystery novels?"
input_text = f"<|start_header_id|>user<|end_header_id|>\n\n{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = tokenizer(input_text, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()

pred = model.generate(input_ids, max_length=200, temperature=1.0, do_sample=True, use_cache=True)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
"""
I'd be delighted to recommend some mystery novels to you! Here are a few suggestions across various sub-genres:

**Classic Whodunit**

1. "And Then There Were None" by Agatha Christie - A timeless tale of ten strangers who are invited to an isolated island, only to be killed off one by one.
2. "The Murder on the Orient Express" by Agatha Christie - A classic whodunit set on a luxurious train traveling from Istanbul to Paris, where a famous author goes missing.
3. "The Devil in the White City" by Erik Larson - A non-fiction book that combines historical events with a mystery, exploring the 1893 World's Columbian Exposition in Chicago and the serial killer H.H. Holmes.

**Modern Whodunits**

1. "Gone Girl" by Gillian Flynn - A twisty, psychological thriller about a couple whose seemingly perfect ...
"""

πŸ“Š Performance

Model #Training Tokens MMLU(5) GSM8k(8) HumanEval(pass@10) IFEval BoolQ(32) SciQ PIQA ARC-c(25) TruthfulQA HellaSwag(10)
LLaMA3-8B 15T 67.2 76.5 71.4 76.5 83.0 93.2 78.5 61.9 51.7 78.8
INCITE-3B 1T 25.1 2.1 6.92 30.1 66.5 94.7 74.4 40.2 36.4 65.6
Sheared-LLaMA-2.7B 50B 28.2 1.9 3.2 28.8 67.6 75.8 41.1 47.6 71.2 39.0
Gemma-2-2b 2T 53.0 26.3 46.1 34.9 72.3 75.8 67.5 52.6 50.8 69.0
Salamandra-2b 7.8T 25.1 1.90 5.82 27.7 68.0 89.8 74.7 46.3 43.4 62.3
SmolLM2-1.7B 11T 50.4 38.5 39.1 29.0 68.2 84.3 76.0 53.2 39.9 72.6
OpenMoE-3B-9B 1T 26.5 1.36 1.01 31.2 61.7 68.4 65.7 33.3 40.5 56.5
LLaMA-MoE-3B-7B 200B 28.2 4.62 12.0 28.1 68.1 88.8 77.9 44.0 33.3 73.2
OLMoE-1B-7B 1T 53.8 40.9 40.5 35.5 80.9 94.9 80.1 55.6 43.3 79.6
MLP-MoE (8top2) 7B 40.6 53.1 53.5 32.7 74.6 90.6 69.3 42.8 45.6 59.0
MLP-MoE (8top2) 8.4B 41.0 59.6 57.1 31.7 74.5 90.2 69.5 43.3 46.9 58.1
MLP-MoE (1+7top1) 7B 42.7 55.0 51.2 36.0 76.9 88.8 67.9 40.2 46.9 53.7

πŸ“ƒ Citation

@misc{llama-moe-v2,
  title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
  author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
  year={2024},
  month={Nov},
  url={https://arxiv.org/abs/2411.15708}
}
Downloads last month
25
Safetensors
Model size
8.03B params
Tensor type
BF16
Β·
Inference API
Unable to determine this model's library. Check the docs .