kvaishnavi commited on
Commit
d3da9b4
·
verified ·
1 Parent(s): d05c4c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -30
README.md CHANGED
@@ -9,10 +9,11 @@ inference: false
9
  This repository hosts the optimized versions of [Phi-3.5-mini-4k-instruct](https://aka.ms/phi3.5-mini-4k-instruct) to accelerate inference with ONNX Runtime.
10
  Optimized Phi-3.5 Mini models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.
11
 
12
- To easily get started with Phi-3.5, you can use our newly introduced ONNX Runtime Generate() API. See See [here](https://aka.ms/generate-tutorial) for instructions on how to run it.
13
 
14
  ## ONNX Models
15
  Here are some of the optimized configurations we have added:
 
16
  1. ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
17
  2. ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via AWQ.
18
  3. ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via AWQ.
@@ -21,31 +22,38 @@ Here are some of the optimized configurations we have added:
21
  Phi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family and supports 128K token context length. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures.
22
 
23
  ## Intended Uses
24
- Primary Use Cases
25
  The Phi 3.5 model is intended for commercial and research use in multiple languages. The model provides uses for general purpose AI systems and applications which require:
 
26
  1. Memory/compute constrained environments
27
  2. Latency bound scenarios
28
  3. Strong reasoning (especially code, math and logic)
 
29
  ## Use Case Considerations
30
  Phi 3.5 models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
31
  Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
32
- Release Notes
33
- This is an update over the June 2024 instruction-tuned Phi-3 Mini ONNX release based. We believe most use cases will benefit from this release, but we encourage users to test their particular AI applications. We appreciate the enthusiastic adoption of the Phi-3 model family and continue to welcome all feedback from the community.
 
 
34
  ## Hardware Supported
35
  The ONNX models are tested on:
 
36
  - GPU SKU: RTX 4090 (DirectML)
37
  - GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
38
- - CPU SKU: Standard D16s v6 (16 vcpus, 64 GiB memory) , AMD CPU: Internal_D64as_v5
 
 
 
 
 
39
 
40
- ## Minimum Configuration Required:
41
- • Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
42
- • CUDA: NVIDIA GPU with Compute Capability >= 7.0
43
  ## Model Description
44
- • Developed by: Microsoft
45
- • Model type: ONNX
46
- • Language(s) (NLP): Python, C, C++
47
- • License: MIT
48
- • Model Description: This is a conversion of the Phi-3.5 Mini-Instruct model for ONNX Runtime inference.
 
49
 
50
  ## How to Get Started with the Model
51
  To make running of the Phi-3 models across a range of devices and platforms across various execution provider backends possible, we introduce a new API to wrap several aspects of generative AI inferencing. This API make it easy to drag and drop LLMs straight into your app. For running the early version of these models with ONNX Runtime, follow the steps [here](http://aka.ms/generate-tutorial).
@@ -53,7 +61,7 @@ To make running of the Phi-3 models across a range of devices and platforms acro
53
  For example:
54
 
55
  ```python
56
- python model-qa.py -m /*{YourModelPath}*/onnx/cpu_and_mobile/phi-3-mini-4k-instruct-int4-cpu -k 40 -p 0.95 -t 0.8 -r 1.0
57
  ```
58
 
59
  ```
@@ -66,10 +74,10 @@ This joke plays on the double meaning of "make up." In science, atoms are the fu
66
  ```
67
 
68
  ## Performance Metrics
69
- CUDA
70
  Phi-3.5 Mini-Instruct performs better in ONNX Runtime than PyTorch for all batch size, prompt length combinations.
71
- The table below shows the average throughput of the first 256 tokens generated (tps) for FP16 and INT4 precisions on CUDA as measured on 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4. ONNX Runtime Models for GPU are 21X faster than PyTorch Compile and up to 8X faster than llama.cpp on A100 GPU.
72
- Average Throughput of First 256 Tokens Generated (tps)
73
 
74
  | Batch Size, Sequence Length | ONNX RT INT4 | PyTorch Eager INT4 | PyTorch Compile INT4 | Llama.cpp INT4 | INT4 SpeedUp ORT/PyTorch Eager | INT4 SpeedUp ORT/PyTorch Compile | INT4 SpeedUp ORT/Llama.cpp |
75
  | --- | --- | --- | --- | --- | --- | --- | --- |
@@ -97,7 +105,8 @@ The table below shows the average throughput of the first 256 tokens generated (
97
  | 16,1024 | 536.73 | 209.13 | 169.30 | 71.57 | 2.57 | 3.17 | 7.50 |
98
  | 16,2048 | 375.31 | 153.95 | 158.77 | 45.97 | 2.44 | 2.36 | 8.16 |
99
  | 16,3840 | 243.66 | OOM | OOM | 28.33 | | | 8.60 |
100
-
 
101
  | Batch Size, Sequence Length | ONNX RT FP16 | PyTorch Eager FP16 | PyTorch Compile FP16 | Llama.cpp | FP16 SpeedUp ORT/PyTorch Eager | FP16 SpeedUp ORT/PyTorch Compile | FP16 SpeedUp ORT/Llama.cpp |
102
  | --- | --- | --- | --- | --- | --- | --- | --- |
103
  | 1,16 | 137.30 | 26.02 | 26.83 | 125.86 | 5.28 | 5.12 | 1.09 |
@@ -125,8 +134,9 @@ The table below shows the average throughput of the first 256 tokens generated (
125
  | 16,2048 | 441.15 | 121.17 | 162.93 | 41.30 | 3.64 | 2.71 | 10.68 |
126
  | 16,3840 | 270.38 | OOM | OOM | 26.50 | 0.00 | 0.00 | 10.20 |
127
 
 
128
  The table below shows the average throughput of the first 256 tokens generated (tps) for INT4 precision on CPU as measured on a Standard D16s v6 (16 vcpus, 64 GiB memory)
129
- Average Throughput of First 256 Tokens Generated (tps)
130
  | Batch Size, Sequence Length | ORT INT4 AWQ | Llama.cpp INT4 | INT4 AWQ SpeedUp Llama.cpp |
131
  | --- | --- | --- | --- |
132
  | 1,16 | 41.99 | 26.72 | 1.57 |
@@ -134,24 +144,33 @@ The table below shows the average throughput of the first 256 tokens generated (
134
  | 1,256 | 41.26 | 26.30 | 1.57 |
135
  | 1,1024 | 37.15 | 24.02 | 1.55 |
136
  | 1,2048 | 32.68 | 21.82 | 1.50 |
137
-
 
138
  ## Package Versions
139
- Pip packages torch 2.4.1
140
- triton 3.0.0
141
- onnxruntime 1.19.2
142
- transformers 4.44.2
143
- llama.cpp bdf314f38a2c90e18285f7d7067e8d736a14000a
 
 
 
 
 
 
144
 
145
  ## Appendix
146
- Activation Aware Quantization" AWQ works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see here.
 
147
 
148
  ## Model Card Contact
149
  parinitarahi
 
150
  ## Contributors
151
- Sunghoon Choi, Yufeng Li, Kunal Vaishnavi, Akshay Sonawane, Rui Ren, Parinita Rahi
 
152
  ## License
153
  The model is licensed under the MIT license.
154
- ## Trademarks
155
- This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
156
-
157
 
 
 
 
9
  This repository hosts the optimized versions of [Phi-3.5-mini-4k-instruct](https://aka.ms/phi3.5-mini-4k-instruct) to accelerate inference with ONNX Runtime.
10
  Optimized Phi-3.5 Mini models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.
11
 
12
+ To easily get started with Phi-3.5, you can use our newly introduced ONNX Runtime Generate() API. See [here](https://aka.ms/generate-tutorial) for instructions on how to run it.
13
 
14
  ## ONNX Models
15
  Here are some of the optimized configurations we have added:
16
+
17
  1. ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs.
18
  2. ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via AWQ.
19
  3. ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via AWQ.
 
22
  Phi-3.5-mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family and supports 128K token context length. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures.
23
 
24
  ## Intended Uses
 
25
  The Phi 3.5 model is intended for commercial and research use in multiple languages. The model provides uses for general purpose AI systems and applications which require:
26
+
27
  1. Memory/compute constrained environments
28
  2. Latency bound scenarios
29
  3. Strong reasoning (especially code, math and logic)
30
+
31
  ## Use Case Considerations
32
  Phi 3.5 models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
33
  Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
34
+
35
+ ## Release Notes
36
+ This is an update over the instruction-tuned Phi-3 Mini ONNX model release. We believe most use cases will benefit from this release, but we encourage users to test their particular AI applications. We appreciate the enthusiastic adoption of the Phi-3 model family and continue to welcome all feedback from the community.
37
+
38
  ## Hardware Supported
39
  The ONNX models are tested on:
40
+
41
  - GPU SKU: RTX 4090 (DirectML)
42
  - GPU SKU: 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4 (CUDA)
43
+ - CPU SKU: Standard D16s v6 (16 vcpus, 64 GiB memory)
44
+ - AMD CPU: Internal_D64as_v5
45
+
46
+ Minimum Configuration Required:
47
+ - Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
48
+ - CUDA: NVIDIA GPU with Compute Capability >= 7.0
49
 
 
 
 
50
  ## Model Description
51
+
52
+ - **Developed by:** Microsoft
53
+ - **Model type:** ONNX
54
+ - **Language(s) (NLP):** Python, C, C++
55
+ - **License:** MIT
56
+ - **Model Description:** This is a conversion of the Phi-3.5 Mini-Instruct model for ONNX Runtime inference.
57
 
58
  ## How to Get Started with the Model
59
  To make running of the Phi-3 models across a range of devices and platforms across various execution provider backends possible, we introduce a new API to wrap several aspects of generative AI inferencing. This API make it easy to drag and drop LLMs straight into your app. For running the early version of these models with ONNX Runtime, follow the steps [here](http://aka.ms/generate-tutorial).
 
61
  For example:
62
 
63
  ```python
64
+ python model-qa.py -m /*{YourModelPath}*/Phi-3.5-mini-instruct-onnx/cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4 -k 40 -p 0.95 -t 0.8 -r 1.0
65
  ```
66
 
67
  ```
 
74
  ```
75
 
76
  ## Performance Metrics
77
+
78
  Phi-3.5 Mini-Instruct performs better in ONNX Runtime than PyTorch for all batch size, prompt length combinations.
79
+
80
+ The table below shows the average throughput of the first 256 tokens generated (tps) for FP16 and INT4 precisions on CUDA as measured on 1 A100 80GB GPU, SKU: Standard_ND96amsr_A100_v4. ONNX Runtime models for GPU are 21X faster than PyTorch Compile and up to 8X faster than llama.cpp on A100 GPU.
81
 
82
  | Batch Size, Sequence Length | ONNX RT INT4 | PyTorch Eager INT4 | PyTorch Compile INT4 | Llama.cpp INT4 | INT4 SpeedUp ORT/PyTorch Eager | INT4 SpeedUp ORT/PyTorch Compile | INT4 SpeedUp ORT/Llama.cpp |
83
  | --- | --- | --- | --- | --- | --- | --- | --- |
 
105
  | 16,1024 | 536.73 | 209.13 | 169.30 | 71.57 | 2.57 | 3.17 | 7.50 |
106
  | 16,2048 | 375.31 | 153.95 | 158.77 | 45.97 | 2.44 | 2.36 | 8.16 |
107
  | 16,3840 | 243.66 | OOM | OOM | 28.33 | | | 8.60 |
108
+
109
+
110
  | Batch Size, Sequence Length | ONNX RT FP16 | PyTorch Eager FP16 | PyTorch Compile FP16 | Llama.cpp | FP16 SpeedUp ORT/PyTorch Eager | FP16 SpeedUp ORT/PyTorch Compile | FP16 SpeedUp ORT/Llama.cpp |
111
  | --- | --- | --- | --- | --- | --- | --- | --- |
112
  | 1,16 | 137.30 | 26.02 | 26.83 | 125.86 | 5.28 | 5.12 | 1.09 |
 
134
  | 16,2048 | 441.15 | 121.17 | 162.93 | 41.30 | 3.64 | 2.71 | 10.68 |
135
  | 16,3840 | 270.38 | OOM | OOM | 26.50 | 0.00 | 0.00 | 10.20 |
136
 
137
+
138
  The table below shows the average throughput of the first 256 tokens generated (tps) for INT4 precision on CPU as measured on a Standard D16s v6 (16 vcpus, 64 GiB memory)
139
+
140
  | Batch Size, Sequence Length | ORT INT4 AWQ | Llama.cpp INT4 | INT4 AWQ SpeedUp Llama.cpp |
141
  | --- | --- | --- | --- |
142
  | 1,16 | 41.99 | 26.72 | 1.57 |
 
144
  | 1,256 | 41.26 | 26.30 | 1.57 |
145
  | 1,1024 | 37.15 | 24.02 | 1.55 |
146
  | 1,2048 | 32.68 | 21.82 | 1.50 |
147
+
148
+
149
  ## Package Versions
150
+
151
+ | Pip package name | Version |
152
+ |----------------------------|----------|
153
+ | torch | 2.4.1 |
154
+ | triton | 3.0.0 |
155
+ | onnxruntime-gpu | 1.19.2 |
156
+ | onnxruntime-genai | 0.4.0 |
157
+ | onnxruntime-genai-cuda | 0.4.0 |
158
+ | transformers | 4.44.2 |
159
+ | llama.cpp | bdf314f38a2c90e18285f7d7067e8d736a14000a |
160
+
161
 
162
  ## Appendix
163
+
164
+ Activation Aware Quantization (AWQ) works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see [here](https://arxiv.org/abs/2306.00978).
165
 
166
  ## Model Card Contact
167
  parinitarahi
168
+
169
  ## Contributors
170
+ Sunghoon Choi, Yufeng Li, Kunal Vaishnavi, Akshay Sonawane, Rui Ren, Parinita Rahi
171
+
172
  ## License
173
  The model is licensed under the MIT license.
 
 
 
174
 
175
+ ## Trademarks
176
+ This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.