File size: 5,843 Bytes
74cf980 9d48e9c 74cf980 9d48e9c 98973a5 9d48e9c 98973a5 9d48e9c 0f0781d d4656bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
license: apache-2.0
datasets:
- librispeech_asr
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- ONNX
- Intel® Neural Compressor
- neural-compressor
library_name: transformers
---
## INT4 Whisper small ONNX Model
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the repository of INT4 weight only quantization for the Whisper small model in ONNX format, powered by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).
This INT4 ONNX model is generated by [Intel® Neural Compressor](https://github.com/intel/neural-compressor)'s weight-only quantization method.
| Model Detail | Description |
| ----------- | ----------- |
| Model Authors - Company | Intel |
| Date | October 8, 2023 |
| Version | 1 |
| Type | Speech Recognition |
| Paper or Other Resources | - |
| License | Apache 2.0 |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/whisper-small-onnx-int4/discussions)|
| Intended Use | Description |
| ----------- | ----------- |
| Primary intended uses | You can use the raw model for automatic speech recognition inference |
| Primary intended users | Anyone doing automatic speech recognition inference |
| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.|
### Export to ONNX Model
The FP32 model is exported with openai/whisper-small:
```shell
optimum-cli export onnx --model openai/whisper-small whisper-small-with-past/ --task automatic-speech-recognition-with-past --opset 13
```
### Install ONNX Runtime
Install `onnxruntime>=1.16.0` to support [`MatMulFpQ4`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/docs/ContribOperators.md#com.microsoft.MatMulFpQ4) operator.
### Run Quantization
Run INT4 weight-only quantization with [Intel® Neural Compressor](https://github.com/intel/neural-compressor/tree/master).
The weight-only quantization cofiguration is as below:
| dtype | group_size | scheme | algorithm |
| :----- | :---------- | :------ | :--------- |
| INT4 | 32 | sym | RTN |
We provide the key code below. For the complete script, please refer to [whisper example](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/onnxruntime/speech-recognition/quantization).
```python
from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32
model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
config = PostTrainingQuantConfig(
approach="weight_only",
calibration_sampling_size=[8],
op_type_dict={".*": {"weight": {"bits": 4,
"algorithm": ["RTN"],
"scheme": ["sym"],
"group_size": 32}}},)
q_model = quantization.fit(
os.path.join("/path/to/whisper-small-with-past", model), # FP32 model path
config,
calib_dataloader=dataloader)
q_model.save(os.path.join("/path/to/whisper-small-onnx-int4", model)) # INT4 model path
```
### Evaluation
**Operator Statistics**
Below shows the operator statistics in the INT4 ONNX model:
|Model| Op Type | Total | INT4 weight | FP32 weight |
|:-------:|:-------:|:-------:|:-------:|:-------:|
|encoder_model| MatMul | 96 | 72 | 24 |
|decoder_model| MatMul | 169 | 121 | 48 |
|decoder_with_past_model| MatMul | 145 | 97 | 48 |
**Evaluation of wer**
Evaluate the model on `librispeech_asr` dataset with below code:
```python
import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-small'
model_path = 'whisper-small-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
os.path.join(model_path, 'encoder_model.onnx'),
os.path.join(model_path, 'decoder_model.onnx'),
os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
reference = processor.tokenizer._normalize(batch['text'])
references.append(reference)
predicted_ids = model.generate(input_features)[0]
transcription = processor.decode(predicted_ids)
prediction = processor.tokenizer._normalize(transcription)
predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
```
## Metrics (Model Performance):
| Model | Model Size (GB) | wer |
|---|:---:|:---:|
| FP32 |1.42|3.45|
| INT4 |0.53|3.57|
|