alexmarques
commited on
Commit
•
a9d9e16
1
Parent(s):
1093d08
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- vllm
|
4 |
+
- sparsity
|
5 |
+
- quantized
|
6 |
+
pipeline_tag: text-generation
|
7 |
+
license: llama3.1
|
8 |
+
base_model: neuralmagic/Sparse-Llama-3.1-8B-gsm8k-2of4
|
9 |
+
datasets:
|
10 |
+
- openai/gsm8k
|
11 |
+
language:
|
12 |
+
- en
|
13 |
+
metrics:
|
14 |
+
- accuracy
|
15 |
+
---
|
16 |
+
|
17 |
+
# Sparse-Llama-3.1-8B-gsm8k-2of4-quantized.w4a16
|
18 |
+
|
19 |
+
## Model Overview
|
20 |
+
- **Model Architecture:** Llama-3.1-8B
|
21 |
+
- **Input:** Text
|
22 |
+
- **Output:** Text
|
23 |
+
- **Model Optimizations:**
|
24 |
+
- **Sparsity:** 2:4
|
25 |
+
- **Release Date:** 11/21/2024
|
26 |
+
- **Version:** 1.0
|
27 |
+
- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE)
|
28 |
+
- **Model Developers:** Neural Magic
|
29 |
+
|
30 |
+
This is AI model especialized in grade-school math obtained by fine-tuning the 2:4 sparse [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) on the [GSM8k](https://huggingface.co/datasets/openai/gsm8k) dataset, followed by one-shot quantization.
|
31 |
+
It achieves 64.3% 0-shot accuracy on the test set of GSM8k, compared to 66.3% for the fine-tuned dense model [Llama-3.1-8B-gsm8k](https://huggingface.co/neuralmagic/Llama-3.1-8B-gsm8k) — demonstrating over **96.9% accuracy recovery**.
|
32 |
+
In constrast, the pretrained [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) achieves 50.7% 5-shot accuracy and the sparse foundational [Sparse-Llama-3.1-8B-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-2of4) model achieves 56.3% 5-shot accuracy.
|
33 |
+
|
34 |
+
|
35 |
+
### Model Optimizations
|
36 |
+
|
37 |
+
This model was obtained by quantizing the weights of [Sparse-Llama-3.1-8B-gsm8k-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-gsm8k-2of4) to INT4 data type.
|
38 |
+
This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
|
39 |
+
That is on top of the reduction of 50% of weights via 2:4 pruning employed on [Sparse-Llama-3.1-8B-gsm8k-2of4](https://huggingface.co/neuralmagic/Sparse-Llama-3.1-8B-gsm8k-2of4).
|
40 |
+
|
41 |
+
Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights.
|
42 |
+
The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
|
43 |
+
|
44 |
+
## Deployment with vLLM
|
45 |
+
|
46 |
+
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
47 |
+
|
48 |
+
|
49 |
+
## Evaluation
|
50 |
+
|
51 |
+
This model was evaluated on the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
52 |
+
|
53 |
+
### Accuracy
|
54 |
+
#### GSM8k Benchmark
|
55 |
+
<table>
|
56 |
+
<tr>
|
57 |
+
<td><strong>Metric</strong></td>
|
58 |
+
<td style="text-align: center"><strong>Llama-3.1-8B<br>(5-shot)</strong></td>
|
59 |
+
<td style="text-align: center"><strong>Sparse-Llama-3.1-8B-2of4<br>(5-shot)</strong></td>
|
60 |
+
<td style="text-align: center"><strong>Llama-3.1-8B-gsm8k<br>(0-shot)</strong></td>
|
61 |
+
<td style="text-align: center"><strong>Sparse-Llama-3.1-8B-gsm8k-2of4<br>(0-shot)</strong></td>
|
62 |
+
<td style="text-align: center"><strong>Sparse-Llama-3.1-8B-gsm8k-2of4-quantized.w4a16<br>(0-shot)</strong></td>
|
63 |
+
</tr>
|
64 |
+
<tr>
|
65 |
+
<td>Accuracy</td>
|
66 |
+
<td style="text-align: center">50.7%</td>
|
67 |
+
<td style="text-align: center">56.3%</td>
|
68 |
+
<td style="text-align: center">66.3%</td>
|
69 |
+
<td style="text-align: center">66.9%</td>
|
70 |
+
<td style="text-align: center">64.3%</td>
|
71 |
+
</tr>
|
72 |
+
</table>
|