--- license: apache-2.0 datasets: - librispeech_asr metrics: - wer pipeline_tag: automatic-speech-recognition tags: - automatic-speech-recognition - ONNX - Intel® Neural Compressor - neural-compressor library_name: transformers --- ## INT4 Whisper small ONNX Model Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the repository of INT4 weight only quantization for the Whisper small model in ONNX format, powered by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). This INT4 ONNX model is generated by [Intel® Neural Compressor](https://github.com/intel/neural-compressor)'s weight-only quantization method. | Model Detail | Description | | ----------- | ----------- | | Model Authors - Company | Intel | | Date | October 8, 2023 | | Version | 1 | | Type | Speech Recognition | | Paper or Other Resources | - | | License | Apache 2.0 | | Questions or Comments | [Community Tab](https://huggingface.co/Intel/whisper-small-onnx-int4/discussions)| | Intended Use | Description | | ----------- | ----------- | | Primary intended uses | You can use the raw model for automatic speech recognition inference | | Primary intended users | Anyone doing automatic speech recognition inference | | Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.| ### Export to ONNX Model The FP32 model is exported with openai/whisper-small: ```shell optimum-cli export onnx --model openai/whisper-small whisper-small-with-past/ --task automatic-speech-recognition-with-past --opset 13 ``` ### Install ONNX Runtime Install `onnxruntime>=1.16.0` to support [`MatMulFpQ4`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/docs/ContribOperators.md#com.microsoft.MatMulFpQ4) operator. ### Run Quantization Run INT4 weight-only quantization with [Intel® Neural Compressor](https://github.com/intel/neural-compressor/tree/master). The weight-only quantization cofiguration is as below: | dtype | group_size | scheme | algorithm | | :----- | :---------- | :------ | :--------- | | INT4 | 32 | sym | RTN | We provide the key code below. For the complete script, please refer to [whisper example](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/onnxruntime/speech-recognition/quantization). ```python from neural_compressor import quantization, PostTrainingQuantConfig from neural_compressor.utils.constant import FP32 model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx'] for model in model_list: config = PostTrainingQuantConfig( approach="weight_only", calibration_sampling_size=[8], op_type_dict={".*": {"weight": {"bits": 4, "algorithm": ["RTN"], "scheme": ["sym"], "group_size": 32}}},) q_model = quantization.fit( os.path.join("/path/to/whisper-small-with-past", model), # FP32 model path config, calib_dataloader=dataloader) q_model.save(os.path.join("/path/to/whisper-small-onnx-int4", model)) # INT4 model path ``` ### Evaluation **Operator Statistics** Below shows the operator statistics in the INT4 ONNX model: |Model| Op Type | Total | INT4 weight | FP32 weight | |:-------:|:-------:|:-------:|:-------:|:-------:| |encoder_model| MatMul | 96 | 72 | 24 | |decoder_model| MatMul | 169 | 121 | 48 | |decoder_with_past_model| MatMul | 145 | 97 | 48 | **Evaluation of wer** Evaluate the model on `librispeech_asr` dataset with below code: ```python import os from evaluate import load from datasets import load_dataset from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig model_name = 'openai/whisper-small' model_path = 'whisper-small-onnx-int4' processor = WhisperProcessor.from_pretrained(model_name) model = WhisperForConditionalGeneration.from_pretrained(model_name) config = AutoConfig.from_pretrained(model_name) wer = load("wer") librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test") from optimum.onnxruntime import ORTModelForSpeechSeq2Seq from transformers import PretrainedConfig model_config = PretrainedConfig.from_pretrained(model_name) predictions = [] references = [] sessions = ORTModelForSpeechSeq2Seq.load_model( os.path.join(model_path, 'encoder_model.onnx'), os.path.join(model_path, 'decoder_model.onnx'), os.path.join(model_path, 'decoder_with_past_model.onnx')) model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2]) for idx, batch in enumerate(librispeech_test_clean): audio = batch["audio"] input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features reference = processor.tokenizer._normalize(batch['text']) references.append(reference) predicted_ids = model.generate(input_features)[0] transcription = processor.decode(predicted_ids) prediction = processor.tokenizer._normalize(transcription) predictions.append(prediction) wer_result = wer.compute(references=references, predictions=predictions) print(f"Result wer: {wer_result * 100}") ``` ## Metrics (Model Performance): | Model | Model Size (GB) | wer | |---|:---:|:---:| | FP32 |1.42|3.45| | INT4 |0.53|3.57|