Sreenington
commited on
Commit
•
7479fcd
1
Parent(s):
8693275
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- AWQ
|
7 |
+
- phi3
|
8 |
+
---
|
9 |
+
|
10 |
+
# Phi 3 mini 4k instruct - AWQ
|
11 |
+
- Model creator: [Microsoft](https://huggingface.co/microsoft)
|
12 |
+
- Original model: [Phi 3 mini 4k Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
|
13 |
+
|
14 |
+
<!-- description start -->
|
15 |
+
## Description
|
16 |
+
|
17 |
+
This repo contains AWQ model files for [Microsoft's Phi 3 mini 4k Instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct).
|
18 |
+
|
19 |
+
|
20 |
+
### About AWQ
|
21 |
+
|
22 |
+
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
|
23 |
+
|
24 |
+
It is also now supported by continuous batching server [vLLM](https://github.com/vllm-project/vllm), allowing the use of AWQ models for high-throughput concurrent inference in multi-user server scenarios. Note that, at the time of writing, overall throughput is still lower than running vLLM with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.
|
25 |
+
|
26 |
+
|
27 |
+
## Model Details
|
28 |
+
|
29 |
+
The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.
|
30 |
+
The model belongs to the Phi-3 family with the Mini version in two variants [4K](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) and [128K](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) which is the context length (in tokens) that it can support.
|
31 |
+
|
32 |
+
The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures.
|
33 |
+
When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.
|
34 |
+
|
35 |
+
Resources and Technical Documentation:
|
36 |
+
|
37 |
+
+ [Phi-3 Microsoft Blog](https://aka.ms/phi3blog-april)
|
38 |
+
+ [Phi-3 Technical Report](https://aka.ms/phi3-tech-report)
|
39 |
+
+ [Phi-3 on Azure AI Studio](https://aka.ms/phi3-azure-ai)
|
40 |
+
+ Phi-3 GGUF: [4K](https://aka.ms/Phi3-mini-4k-instruct-gguf)
|
41 |
+
+ Phi-3 ONNX: [4K](https://aka.ms/Phi3-mini-4k-instruct-onnx)
|
42 |
+
|
43 |
+
## Prompt Format
|
44 |
+
<pre>
|
45 |
+
<|user|>
|
46 |
+
How to explain the Internet for a medieval knight?<|end|>
|
47 |
+
<|assistant|>
|
48 |
+
</pre>
|
49 |
+
|
50 |
+
|
51 |
+
## How to use
|
52 |
+
|
53 |
+
### using vLLM
|
54 |
+
```python
|
55 |
+
from vllm import LLM, SamplingParams
|
56 |
+
|
57 |
+
# Create a sampling params object.
|
58 |
+
sampling_params = SamplingParams(max_tokens=128)
|
59 |
+
|
60 |
+
# Create an LLM.
|
61 |
+
llm = LLM(model="Sreenington/Phi-3-mini-4k-instruct-AWQ", quantization="AWQ")
|
62 |
+
|
63 |
+
# Prompt template
|
64 |
+
prompt = """
|
65 |
+
<|user|>
|
66 |
+
How to explain the Internet for a medieval knight?<|end|>
|
67 |
+
<|assistant|>
|
68 |
+
"""
|
69 |
+
|
70 |
+
outputs = llm.generate(prompt, sampling_params)
|
71 |
+
|
72 |
+
# Print the outputs.
|
73 |
+
for output in outputs:
|
74 |
+
prompt = output.prompt
|
75 |
+
generated_text = output.outputs[0].text
|
76 |
+
print(f"Prompt: {prompt!r}\n Generated text:\n {generated_text!r}")
|
77 |
+
```
|