File size: 3,668 Bytes
5658e63
 
e1f29e0
 
 
 
 
 
 
 
 
 
 
 
 
5658e63
e1f29e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: apache-2.0
datasets:
- librispeech_asr
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- int8
- ONNX
- PostTrainingStatic
- Intel® Neural Compressor
- neural-compressor
library_name: transformers
---
## Model Details: INT8 Whisper base

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.

This int8 ONNX model is generated by [neural-compressor](https://github.com/intel/neural-compressor) and the fp32 model can be exported with below command:
```shell
optimum-cli export onnx --model openai/whisper-base whisper-base-with-past/ --task automatic-speech-recognition-with-past --opset 13
```

| Model Detail | Description |
| ----------- | ----------- | 
| Model Authors - Company | Intel | 
| Date | August 25, 2023 | 
| Version | 1 | 
| Type | Speech Recognition | 
| Paper or Other Resources | - | 
| License | Apache 2.0 |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/whisper-base-int8-static/discussions)|

| Intended Use | Description |
| ----------- | ----------- | 
| Primary intended uses | You can use the raw model for automatic speech recognition inference | 
| Primary intended users | Anyone doing automatic speech recognition inference | 
| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task.  The model should not be used to intentionally create hostile or alienating environments for people.|


### How to use

Download the model by cloning the repository:
```shell
git clone https://huggingface.co/Intel/whisper-base-int8-static
```

Evaluate the model with below code:
```python
import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-base'
model_path = 'whisper-base-int8-static'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
accuracy = 1 - wer_result
print("Accuracy: %.5f" % accuracy)
```

## Metrics (Model Performance):
| Model  | Model Size (GB) | wer |
|---|:---:|:---:|
| FP32 |0.95|5.04|
| INT8 |0.32|5.89|