|
--- |
|
datasets: |
|
- wikipedia |
|
- allenai/c4 |
|
language: |
|
- en |
|
tags: |
|
- MoE |
|
--- |
|
## LLaMA-8x265M-MoE |
|
|
|
[💻 Code](https://github.com/JuncaiL/SpecMoE/) |
|
|
|
👋 Very nice to meet you here~ |
|
|
|
❤️ This repo contains the model `LLaMA-8x265M-MoE`(970M totally), which activates 2 out of 8 experts (332M parameters). This model is trained from scratch with FP32 precision. We firstly train the model through wikipedia dataset with 1 epoch and then through 10% of C4 dataset (10 data shards among 1024 data shards) with 1 epoch. This is NOT fine-tuned by instruction pairs, so it may not be good enough to act like a chatbot. |
|
|
|
📢 This series also includes a dense version (without MoE structure), see [🤗this repo](https://huggingface.co/JuncaiL/llama-265m). |
|
|
|
|
|
|
|
### 1. 🚀QuickStart |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_dir = "JuncaiL/llama-8x265m-moe" |
|
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True) |
|
model.eval() |
|
model.to("cuda:0") |
|
|
|
input_text = "Beijing is a famous city" |
|
inputs = tokenizer(input_text, return_tensors="pt",return_token_type_ids=False) |
|
inputs = inputs.to("cuda:0") |
|
|
|
pred = model.generate(**inputs, max_length=50, temperature=0.0) |
|
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)) |
|
# Beijing is a famous city in China. It is the capital of the Beijing Province and the largest city in China. It is also the home of the world’s largest city, Beijing. |
|
#The city is the |
|
``` |
|
|
|
|
|
|
|
### 2. 📑Checkpoint Details and Evaluation |
|
|
|
**Model Parameter** |
|
|
|
| Model | #Experts | #Activated Experts | #Params | # Activated Params | Flops(T) per sample (se q=2048) | Model Weights | |
|
| ------------------- | -------- | ------------------ | ------- | ------------------ | --------------------------------- | ------------------------------------------------------------ | |
|
| 265M | - | - | 265M | 265M | 0.48 | [🤗 llama-265m](https://huggingface.co/JuncaiL/llama-265m) | |
|
| 8 $\times$ 265M MoE | 8 | 2 | 970M | 332M | 0.76 | [🤗 llama-8x265m-moe](https://huggingface.co/JuncaiL/llama-8x265m-moe) | |
|
| llama-7b | - | - | 7B | 7B | 25.29 | | |
|
|
|
**Model Evaluation** |
|
|
|
We use the "Average number of tokens verified" $N$ ( see reference [link](https://arxiv.org/abs/2305.09781) ) as the metric to evaluate these models. This metric demonstrates that giving the same input to the small speculative model and llama-7b, counting from the first predicted tokens, how many successive tokens in the output sentence of the small speculative model are the same as the output sentence of the llama-7b. |
|
|
|
- **Average number of tokens verified** |
|
|
|
| Dataset | 8 $\times$ 265M MoE | GPT without MoE | |
|
| ------------------------------------- | ------------------- | --------------- | |
|
| tatsu-lab/alpaca | 3.2362 | 3.0334 | |
|
| alespalla/chatbot_instruction_prompts | 3.2031 | 3.0823 | |
|
| web_questions | 2.7201 | 2.5541 | |
|
| MohamedRashad/ChatGPT-prompts | 3.0954 | 2.9768 | |
|
|
|
Supposed that the small speculative model can have a hit rate $p$ for the next token when giving the same input. Then we have |
|
|
|
$$ 1p + 2p^2 + 3p^3 + ... = N $$ |
|
|
|
We can get the hit rate as follow. |
|
|
|
$$ p = 1 + \frac{1-\sqrt{1+4N}}{2N}$$ |
|
|
|
- **Hit Rate** |
|
|
|
| Dataset | 8 $\times$ 265M MoE | GPT without MoE | |
|
| ------------------------------------- | ------------------- | --------------- | |
|
| tatsu-lab/alpaca | 0.578 | 0.567 | |
|
| alespalla/chatbot_instruction_prompts | 0.576 | 0.570 | |
|
| web_questions | 0.550 | 0.540 | |
|
| MohamedRashad/ChatGPT-prompts | 0.571 | 0.565 | |
|
|
|
|
|
|
|
### 3. 🚧Limitation and Future Plans |
|
|
|
For the MoE model, we only show the accuracy of how this small speculative model approximates the performance of llama-7b. In practice, to achieve physically low latency, the implementation of our MoE needs to be improved. In this version, we calculate the result of MoE expert by expert (sequentially) , and we need to fuse the calculation of these experts. |
|
|
|
|
|
|
|
### Acknowledgment |
|
|
|
1. My implementation of MoE structure is based on the repo `https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8` |
|
2. My inspiration for Speculative Inference comes from the paper "SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification" ([link](https://arxiv.org/abs/2305.09781)) . I am very appreciative of the help and suggestions from the SpecInfer group. ❤️ |
|
|
|
|
|
|
|
### Citation |
|
|
|
``` |
|
@misc{specmoe-2024, |
|
title={SpecMoE: Building A Speculative MoE Model To Accelerate Inference}, |
|
author={Juncai Liu}, |
|
year={2024}, |
|
month={March}, |
|
url={https://github.com/JuncaiL/SpecMoE/} |
|
} |
|
``` |
|
|
|
|
|
|
|
### Contact |
|
|
|
If you have any interest or question about this project, please feel free to contact me. |
|
|
|
`liujc19@mails.tsinghua.edu.cn` (before June 30, 2024) or `liujc19@tsinghua.org.cn` (After June 30, 2024) |