nm-research commited on
Commit
c718178
·
verified ·
1 Parent(s): 140b8be

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +174 -0
README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - deepseek
5
+ - fp8
6
+ - vllm
7
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
8
+ library_name: transformers
9
+ ---
10
+
11
+ # DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic
12
+
13
+ ## Model Overview
14
+ - **Model Architecture:** DeepSeek-R1-Distill-Qwen-32B
15
+ - **Input:** Text
16
+ - **Output:** Text
17
+ - **Model Optimizations:**
18
+ - **Weight quantization:** FP8
19
+ - **Activation quantization:** FP8
20
+ - **Release Date:** 2/6/2025
21
+ - **Version:** 1.0
22
+ - **Model Developers:** Neural Magic
23
+
24
+ Quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B).
25
+
26
+ ### Model Optimizations
27
+
28
+ This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM.
29
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
30
+
31
+ ## Deployment
32
+
33
+ ### Use with vLLM
34
+
35
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
36
+
37
+ ```python
38
+ from transformers import AutoTokenizer
39
+ from vllm import LLM, SamplingParams
40
+
41
+ max_model_len, tp_size = 4096, 1
42
+ model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic"
43
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
44
+ llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
45
+ sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
46
+
47
+ messages_list = [
48
+ [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
49
+ ]
50
+
51
+ prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
52
+
53
+ outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
54
+
55
+ generated_text = [output.outputs[0].text for output in outputs]
56
+ print(generated_text)
57
+ ```
58
+
59
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
60
+
61
+ ## Creation
62
+
63
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
64
+
65
+
66
+ ```python
67
+ import argparse
68
+ from transformers import AutoModelForCausalLM, AutoTokenizer
69
+ from llmcompressor.modifiers.quantization import QuantizationModifier
70
+ from llmcompressor.transformers import oneshot
71
+ import os
72
+
73
+ def main():
74
+ parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
75
+ parser.add_argument('--model_id', type=str, required=True,
76
+ help='The model ID from HuggingFace (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")')
77
+ parser.add_argument('--save_path', type=str, default='.',
78
+ help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
79
+ args = parser.parse_args()
80
+
81
+ # Load model
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True,
84
+ )
85
+ tokenizer = AutoTokenizer.from_pretrained(args.model_id)
86
+
87
+ # Configure the quantization algorithm and scheme
88
+ recipe = QuantizationModifier(
89
+ targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
90
+ )
91
+
92
+ # Apply quantization
93
+ oneshot(model=model, recipe=recipe)
94
+
95
+ save_path = os.path.join(args.save_path, args.model_id.split("/")[1] + "-FP8-dynamic")
96
+ os.makedirs(save_path, exist_ok=True)
97
+
98
+ # Save to disk in compressed-tensors format
99
+ model.save_pretrained(save_path)
100
+ tokenizer.save_pretrained(save_path)
101
+ print(f"Model and tokenizer saved to: {save_path}")
102
+
103
+ if __name__ == "__main__":
104
+ main()
105
+ ```
106
+
107
+ ## Evaluation
108
+
109
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), using the following commands:
110
+
111
+ OpenLLM Leaderboard V1:
112
+ ```
113
+ lm_eval \
114
+ --model vllm \
115
+ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
116
+ --tasks openllm \
117
+ --write_out \
118
+ --batch_size auto \
119
+ --output_path output_dir \
120
+ --show_config
121
+ ```
122
+
123
+ OpenLLM Leaderboard V2:
124
+ ```
125
+ lm_eval \
126
+ --model vllm \
127
+ --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
128
+ --apply_chat_template \
129
+ --fewshot_as_multiturn \
130
+ --tasks leaderboard \
131
+ --write_out \
132
+ --batch_size auto \
133
+ --output_path output_dir \
134
+ --show_config
135
+
136
+ ```
137
+
138
+ ### Accuracy
139
+
140
+ #### OpenLLM Leaderboard V1 evaluation scores
141
+
142
+ | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic |
143
+ |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
144
+ | ARC-Challenge (Acc-Norm, 25-shot) | 64.59 | 64.42 |
145
+ | GSM8K (Strict-Match, 5-shot) | 82.71 | 82.64 |
146
+ | HellaSwag (Acc-Norm, 10-shot) | 83.80 | 83.77 |
147
+ | MMLU (Acc, 5-shot) | 81.12 | 80.98 |
148
+ | TruthfulQA (MC2, 0-shot) | 58.41 | 58.30 |
149
+ | Winogrande (Acc, 5-shot) | 76.40 | 76.09 |
150
+ | **Average Score** | **74.51** | **74.36** |
151
+ | **Recovery (%)** | **100.00** | **99.79** |
152
+
153
+ #### OpenLLM Leaderboard V2 evaluation scores
154
+
155
+ | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic |
156
+ |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
157
+ | IFEval (Inst-and-Prompt Level Strict Acc, 0-shot) | 42.87 | 42.26 |
158
+ | BBH (Acc-Norm, 3-shot) | 57.96 | 58.38 |
159
+ | GPQA (Acc-Norm, 0-shot) | 26.95 | 26.86 |
160
+ | MUSR (Acc-Norm, 0-shot) | 43.95 | 44.22 |
161
+ | MMLU-Pro (Acc, 5-shot) | 49.82 | 49.43 |
162
+ | **Average Score** | **44.31** | **44.23** |
163
+ | **Recovery (%)** | **100.00** | **99.82** |
164
+
165
+ #### Coding evaluation scores
166
+
167
+ | Metric | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | neuralmagic/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic |
168
+ |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
169
+ | HumanEval pass@1 | 86.00 | 85.20 |
170
+ | HumanEval pass@10 | 92.50 | 92.20 |
171
+ | HumanEval+ pass@1 | 82.00 | 80.90 |
172
+ | HumanEval+ pass@10 | 88.70 | 88.70 |
173
+ | **Average Score** | **87.30** | **86.75** |
174
+ | **Recovery (%)** | **100.00** | **99.37** |