Intel
/

Phi-3-mini-128k-instruct-int4-inc-recipe

English

Model card Files Files and versions Community

isaacmac commited on 17 days ago

Commit

7b2e446

•

1 Parent(s): 30d209a

Update README.md

Browse files

Files changed (1) hide show

README.md +55 -19

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ language:
 ## Model Details
-This model is an int4 model with group_size 128 of [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)  generated by [intel/auto-round](https://github.com/intel/auto-round).
 Inference of this model is compatible with AutoGPTQ's Kernel.
@@ -18,16 +18,14 @@ Inference of this model is compatible with AutoGPTQ's Kernel.
-### Reproduce the model
 Here is the sample command to reproduce the model
 ```bash
-git clone https://github.com/intel/auto-round
-cd auto-round/examples/language-modeling
-pip install -r requirements.txt
-python3 main.py \
---model_name  microsoft/Phi-3-mini-128k-instruct \
 --device 0 \
 --group_size 128 \
 --bits 4 \
@@ -35,9 +33,9 @@ python3 main.py \
 --nsamples 512 \
 --seqlen 4096 \
 --minmax_lr 0.01 \
---deployment_device 'gpu' \
 --gradient_accumulate_steps 2 \
---train_bs 4 \
 --output_dir "./tmp_autoround" \
 ```
@@ -46,15 +44,59 @@ python3 main.py \
-### Evaluate the model
-Install [lm-eval-harness 0.4.2](https://github.com/EleutherAI/lm-evaluation-harness.git) from source.
 ```bash
-lm_eval --model hf --model_args pretrained="Intel/Phi-3-mini-128k-instruct-int4-inc",autogptq=True,gptq_use_triton=True --device cuda:0 --tasks lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,openbookqa,boolq,arc_easy,arc_challenge,mmlu --batch_size 32
 ```
-| Metric         | FP16   | INT4   |
 | -------------- | ------ | ------ |
 | Avg.           | 0.6364 | 0.6300 |
 | mmlu           | 0.6215 | 0.6237 |
@@ -68,11 +110,6 @@ lm_eval --model hf --model_args pretrained="Intel/Phi-3-mini-128k-instruct-int4-
 | arc_easy       | 0.8119 | 0.8199 |
 | arc_challenge  | 0.5418 | 0.5350 |
 ## Caveats and Recommendations
 Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
@@ -80,7 +117,6 @@ Users (both direct and downstream) should be made aware of the risks, biases and
 Here are a couple of useful links to learn more about Intel's AI software:
 * Intel Neural Compressor [link](https://github.com/intel/neural-compressor)
-* Intel Extension for Transformers [link](https://github.com/intel/intel-extension-for-transformers)

 ## Model Details
+This model is an int4 model recipe with group_size 128 of [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)  generated by [intel/auto-round](https://github.com/intel/auto-round).
 Inference of this model is compatible with AutoGPTQ's Kernel.
+### Quantize the model
 Here is the sample command to reproduce the model
 ```bash
+pip install auto-round
+auto-round
+--model  microsoft/Phi-3-mini-128k-instruct \
 --device 0 \
 --group_size 128 \
 --bits 4 \
 --nsamples 512 \
 --seqlen 4096 \
 --minmax_lr 0.01 \
+--format 'auto_gptq' \
 --gradient_accumulate_steps 2 \
+--batch_size 4 \
 --output_dir "./tmp_autoround" \
 ```
+## How to use
+### INT4 Inference with IPEX on Intel CPU
+Install the latest [Intel Extension for Pytorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel Neural Compressor](https://github.com/intel/neural-compressor)
 ```bash
+pip install torch --index-url https://download.pytorch.org/whl/cpu
+pip install intel_extension_for_pytorch
+pip install neural_compressor_pt
 ```
+```python
+from transformers import AutoTokenizer
+from neural_compressor.transformers import AutoModelForCausalLM
+## note: use quantized model directory name below
+model_name_or_path="./tmp_autoround/<model directory name>"
+q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
+prompt = "Once upon a time, a little girl"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
+print(tokenizer.decode(q_model.generate(**tokenizer(prompt, return_tensors="pt").to(q_model.device),max_new_tokens=50)[0]))
+##Once upon a time, a little girl named Lily was playing in her backyard. She loved to explore and discover new things. One day, she stumbled upon a beautiful garden filled with colorful flowers andugh the garden, she noticed a
+```
+### INT4 Inference on Intel Gaudi Accelerator
+docker image with Gaudi Software Stack is recommended. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/).
+```python
+import habana_frameworks.torch.core as htcore
+from neural_compressor.torch.quantization import load
+from transformers import  AutoTokenizer, AutoModelForCausalLM
+## note: use quantized model directory name below
+model_name_or_path="./tmp_autoround/<model directory name>"
+model = load(
+    model_name_or_path=model_name_or_path,
+    format="huggingface",
+    device="hpu"
+)
+prompt = "Once upon a time, a little girl"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+print(tokenizer.decode(model.generate(**tokenizer(prompt, return_tensors="pt").to("hpu"),max_new_tokens=50)[0]))
+```
+## Accuracy Result
+ | Metric   <img width=200>     | FP16  <img width=200>   | INT4  <img width=200>   |
 | -------------- | ------ | ------ |
 | Avg.           | 0.6364 | 0.6300 |
 | mmlu           | 0.6215 | 0.6237 |
 | arc_easy       | 0.8119 | 0.8199 |
 | arc_challenge  | 0.5418 | 0.5350 |
 ## Caveats and Recommendations
 Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
 Here are a couple of useful links to learn more about Intel's AI software:
 * Intel Neural Compressor [link](https://github.com/intel/neural-compressor)