File size: 16,889 Bytes
54fc463
 
 
 
 
 
 
 
 
 
 
 
582caab
54fc463
 
 
 
 
 
 
 
db52155
 
 
 
 
 
 
 
 
 
 
 
54fc463
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
language:
- en
- zh
tags:
- MiniCPM
- ModelBest
- THUNLP
license: apache-2.0
---


# MiniCPM-S-1B-sft-gguf

- Original model: [MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16)
- Model creator and fine-tuned by: [ModelBest](https://modelbest.cn/), [OpenBMB](https://huggingface.co/openbmb), and [THUNLP](https://nlp.csai.tsinghua.edu.cn/)
- Paper: [link](https://arxiv.org/pdf/2402.13516.pdf) (Note: `MiniCPM-S-1B` is denoted as `ProSparse-1B` in the paper.)
- Converted from: [MiniCPM-S-1B-sft-llama-format](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-llama-format)

**This GGUF file is created to be directly runnable with [PowerInfer](https://github.com/SJTU-IPADS/PowerInfer).**

### Chat Template

To make the model sophisticatedly respond to a query, it is recommended to use a standard chat prompt, such as:

```
<用户>{prompt}<AI>
```

where `prompt` is the query text, while `<用户>` and `<AI>` are prompt tokens.

Also, make sure that you have **a bos token `<s>` at the beginning of any input**, or the model can sometimes behave improperly.

### Introduction

The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.

Adopting ReLU as the activation function is a straightforward method to achieve activation sparsity. However, most recent mainstream LLMs adopt activation functions without intrinsic sparsity (e.g., GELU and Swish). Some efforts ([Zhang et al., 2022](https://aclanthology.org/2022.findings-acl.71.pdf); [Mirzadeh et al., 2023](https://arxiv.org/pdf/2310.04564.pdf); [Zhang et al., 2024](https://arxiv.org/pdf/2402.03804.pdf)) introduce ReLU or its variants as the substitutive activation function to help non-ReLU LLMs achieve activation sparsity and inference acceleration, but few can concurrently obtain high sparsity and comparable task-specific performance.

In this work, we introduce a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity while maintaining comparable performance. By applying ProSparse to Swish-activated LLaMA2-7B, LLaMA2-13B, and MiniCPM-1B, we obtain ReLU-activated models with high sparsity of 89.32%, 88.80%, and 87.89%, respectively, while their performance is comparable to the original version. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Further inference acceleration experiments demonstrate the practical speedup effects of higher sparsity on both [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf) and our two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator).

### Training Dataset

We train the 1B model on about 473.02 billion tokens within 101,000 steps. These consist of 35,000 steps for standard ProSparse pre-training, 60,000 steps for decay, and 6,000 steps for SFT. Except for ProSparse, other training settings are highly consistent with the original [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). Refer to our [paper](https://arxiv.org/pdf/2402.13516.pdf) and [MiniCPM technical report](https://arxiv.org/pdf/2404.06395) for more details.

Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.

### ProSparse: Training Methodology

The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details):

1. **Activation Function Substitution**: We substitute the activation function of FFNs with ReLU and apply continual training;
2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
3. **Activation Threshold Shifting**: We finally replace ReLU with FATReLU ([Kurtz et al., 2020](https://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf)), a ReLU variant with a positive threshold. This can prune those non-zero weakly-contributed elements in activation outputs and further boost sparsity.

The hyper-parameters for each stage (including the regularization factor \\(\lambda_i\\), the accumulated training steps \\(T_i\\), and the accumulated training tokens) are shown as follows:

| Step Number \\(i\\) | \\(\lambda_i\\) | \\(T_i\\)  | Accumulated Tokens (B) |
| :-------------: | :---------: | :----: | :--------------------: |
|        0        |      0      | 10,000  |         49.15          |
|        1        |   \\(1e-3\\)    | 15,000  |         73.73          |
|        2        |   \\(5e-3\\)    | 20,000 |         98.30          |
|        3        |   \\(5e-3\\)    | 25,000 |         122.88          |
|        4        |   \\(5e-2\\)    | 35,000 |         172.03          |
|      decay      |   \\(5e-2\\) (fixed)    | 95,000 |         466.94          |
|       SFT       |   \\(1e-2\\) (fixed)    | 101,000 |         473.02          |

### Evaluation Results

The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Our evaluation is based on the framework [UltraEval](https://github.com/OpenBMB/UltraEval). The evaluation details are listed as follows:

- **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).

- **Commonsense Reasoning**: We report the average 0-shot accuracies on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.

- **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, LAMBADA, and TyDi QA.

- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot).

**Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.

|        Setting        | Average<br>Sparsity | Average<br>Performance | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU  |  BBH  | AGI Eval |
| :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
| LLaMA2-7B    | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 |
| ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 |  5.84 | 38.64 | 35.07 | 27.73 |
| **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 |
| **ProSparse-7B**   | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 |
| LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 |
| ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 |
| **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 |
| **ProSparse-13B**   | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 |
| MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 |
| **MiniCPM-S-1B**\*  | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 |
| **MiniCPM-S-1B**    | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 |

**Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "MiniCPM-S-1B\*" denote the ProSparse versions without activation threshold shifting.

### Evaluation Issues with LM-Eval

The above results can be replicated with [UltraEval](https://github.com/OpenBMB/UltraEval). Some abnormal results obtained with other popular frameworks such as [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness) are probably attributed to the absence of the cls token `<s>`, which is not added by default in LM-Eval. A quick temporary fix is shown in the following codes. Other differences in evaluation results may be caused by other reasons, including the few-shot settings, data pre-processing, and extra prompts.

```python
# https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/huggingface.py#L945
for _, context_enc, continuation_enc in chunk:
    # sanity check
    assert len(context_enc) > 0
    # Note: a trivial fix here
    if context_enc[0] != 1:
        context_enc = [1] + context_enc
    assert len(continuation_enc) > 0
    assert len(continuation_enc) <= self.max_length
```

Here are the steps to adapting the original [vLLM](https://github.com/vllm-project/vllm) to ProSparse LLaMA models.

1. Replace the file [vllm/model_executor/models/llama.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) in original vLLM with this [file](https://github.com/Raincleared-Song/DejaVu_predictor/blob/main/llama.py).
2. Replace the contents of the original [config.json](https://huggingface.co/SparseLLM/prosparse-llama-2-7b/blob/main/config.json) with this [file](https://github.com/Raincleared-Song/DejaVu_predictor/blob/main/config.json).
3. Set the environment variable `ACT_INFO`. To test the version without activation threshold shifting, `export ACT_INFO=relu`. To test the version with activation threshold shifting, `export ACT_INFO=fatrelu_0.01`.

### Inference Acceleration Effects

First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs). The GGUF files and activation predictors are also available for ProSparse LLaMA models.

Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:

- Step (2) (`S2`): a fused operator of ReLU and \\(\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)\\);
- Step (3) (`S3`): a sparse matrix-vector multiplication operator \\(\mathbf{x}_1 \mathbf{W}_2^T\\).

where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.

The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.

|        Setting        | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time | Speedup<br>to Dense | `S3`<br/>Time | Speedup<br/>to Dense |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
| Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
|     ReluLLaMA-7B      |        66.98        |        90.89         |         58.95         |        11.37        | 3.10 |      67.12       |        1.35         |       63.00       |         1.32         |
| **ProSparse-7B**\*  |        88.11        |      **93.46**       |         75.24         |        **16.30**        | **4.44** |      46.66       |        1.94         |       55.56       |         1.49         |
|   **ProSparse-7B**    |      **89.32**      |        92.34         |       **78.75**       |          -          | - |      **45.38**       |        **2.00**         |       **55.05**       |         **1.51**         |
| Dense-13B | - | - | - | 1.92 | 1.00 | 131.36 | 1.00 | 113.68 | 1.00 |
|     ReluLLaMA-13B     |        71.56        |        86.41         |         71.93         |        6.59         | 3.43 |      69.92       |        1.88         |       75.47       |         1.51         |
| **ProSparse-13B**\* |        87.97        |        91.02         |         77.93         |        **8.67**         | **4.52** |      55.29       |        2.38         |       67.50       |         1.68         |
|   **ProSparse-13B**   |        **88.80**        |        **91.11**         |         **78.28**         |          -          | - |      **53.78**       |        **2.44**         |       **66.73**       |         **1.70**         |

**Notes**: For "Dense" settings, the "Inference Speed" (token/sec) is obtained by [llama.cpp](https://github.com/ggerganov/llama.cpp), and the time (us) for steps (2) and (3) is measured without sparse GPU operators. For other sparse settings, the "Inference Speed" is obtained by [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), and sparse GPU operators are applied. ProSparse settings with activation threshold shifting and the MiniCPM architecture are not supported by PowerInfer at present.

### Citation

Please kindly cite using the following BibTeX:

```bibtex
@article{song2024prosparse,
  title={{ProSparse}: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models},
  author={Song, Chenyang and Han, Xu and Zhang, Zhengyan and Hu, Shengding and Shi, Xiyu and Li, Kuai and Chen, Chen and Liu, Zhiyuan and Li, Guangli and Yang, Tao and Sun, Maosong},
  year={2024},
  journal={arXiv preprint arXiv:2402.13516},
  url={https://arxiv.org/pdf/2402.13516.pdf}
}
```

### License

This repository is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.

The usage of MiniCPM model weights must strictly follow [the General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md).

The models and weights of MiniCPM are completely free for academic research.

If you intend to utilize the model for commercial purposes, please reach out to cpm@modelbest.cn to obtain the certificate of authorization.

### Statement

As a language model, MiniCPM generates content by learning from a vast amount of text.

However, it does not possess the ability to comprehend or express personal opinions or value judgments.

Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.

Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.

#### Acknowledgments

The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16).

A duplicate of this repo: [link](https://huggingface.co/SparseLLM/ProSparse-MiniCPM-1B-sft).