File size: 10,953 Bytes
2e2a17c 0bbd929 3b06c8d 2e2a17c a8a1409 2e2a17c a8a1409 6535c72 2e2a17c a8a6f91 2e2a17c 3b06c8d 2e2a17c 6535c72 3b06c8d 6535c72 3b06c8d 6535c72 3b06c8d 6535c72 a8a6f91 6535c72 a8a6f91 6535c72 3b06c8d 86d21ea 3b06c8d 86d21ea |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
---
library_name: transformers
base_model: openai/whisper-large-v3
language:
- sv
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- KBLab/rixvox-v2
tags:
- ctranslate2
---
## KB-Whisper Large
The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co/datasets/google/fleurs), [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size).
| Model size | | FLEURS | CommonVoice | NST |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** |
| | OpenAI | 59.2 | 67.8 | 85.2 |
| [base](https://huggingface.co/KBLab/kb-whisper-base) | **KBLab** | **9.1** | **8.7** | **7.8** |
| | OpenAI | 39.6 | 52.1 | 53.4 |
| [small](https://huggingface.co/KBLab/kb-whisper-small) | **KBLab** | **7.3** | **6.4** | **6.6** |
| | OpenAI | 20.6 | 26.4 | 26.4 |
| [medium](https://huggingface.co/KBLab/kb-whisper-medium) | **KBLab** | **6.6** | **5.4** | **5.8** |
| | OpenAI | 12.1 | 15.8 | 17.1 |
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** |
| | OpenAI | 7.8 | 9.5 | 11.3 |
Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and the corresponding OpenAI versions.
### Usage
We provide checkpoints in different formats: `Hugging Face`, `whisper.cpp` (GGML), `onnx`, and `ctranslate2` (used in `faster-whisper` and `WhisperX`).
#### Hugging Face
Inference example for using `KB-Whisper` with Hugging Face:
```python
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-large"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3",
chunk_length_s=30,
generate_kwargs={"task": "transcribe", "language": "sv"})
```
#### Faster-whisper
[Faster-whisper](https://github.com/SYSTRAN/faster-whisper) provides fast and efficient inference via a reimplementation of Whisper using `ctranslate2`.
```python
#### faster-whisper model ####
from faster_whisper import WhisperModel
model_id = "KBLab/kb-whisper-large"
model = WhisperModel(
model_id,
device="cuda",
compute_type="float16",
download_root="cache", # cache directory
# condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
)
# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```
#### WhisperX
[WhisperX](https://github.com/m-bain/whisperX) provides a convenient method of getting accurate word level timestamps. The library combines (force aligns) the text output of Whisper with the accurate timestamps of Wav2vec2. We provide an example below of how to use `KB-Whisper` together with [KBLab/wav2vec2-large-voxrex-swedish](https://huggingface.co/KBLab/wav2vec2-large-voxrex-swedish).
```python
import whisperx
device = "cuda"
audio_file = "audio.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
"KBLab/kb-whisper-large", device, compute_type=compute_type, download_root="cache" # cache_dir
)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment
# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
language_code=result["language"],
device=device,
model_name="KBLab/wav2vec2-large-voxrex-swedish",
model_dir="cache", # cache_dir
)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)
print(result["segments"]) # word level timestamps after alignment
```
#### Whisper.cpp / GGML
We provide GGML checkpoints used in the apps `whisper.cpp` and `MacWhisper`. To use our model with `whisper.cpp` first clone the repository and build the library:
```
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release
```
To use the model you need to download one of the GGML checkpoints we have uploaded. You can either press the download buttons [here](https://huggingface.co/KBLab/kb-whisper-large/tree/main), or download using `wget`:
```
wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-large/resolve/main/ggml-model.bin # Non-quantized version
```
Run inference by specifying the model path after the argument `-m`, along with the path to the audio file as the last positional argument.
```
./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav
```
#### onnx (optimum) and transformers.js usage
You can use the `onnx` checkpoints via Hugging Face's `optimum` library in the following manner:
```python
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor
model_id = "KBLab/kb-whisper-large"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
model_id,
cache_dir="cache",
subfolder="onnx",
)
import soundfile as sf
audio = sf.read("audio.wav")
inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)
```
An example of an app that runs inference locally in the browser with `transformers.js` and `KB-Whisper` can be found at [https://whisper.mesu.re/](https://whisper.mesu.re/) (created by Pierre Mesure). A template for setting up such an app with javascript can be found at [https://github.com/xenova/whisper-web](https://github.com/xenova/whisper-web).
### Training data
Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.
Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).
| Dataset | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
|-------------|--------------------------|--------------|
| Subtitles | 34,261 | 3,110 |
| Riksdag | 21,949 | 5,119 |
| ISOF | 54 | 54 |
| NST | 250 | 250 |
| **Total** | **56,514** | **8,533** |
The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the `revision` in `.from_pretrained()`. The pretrained checkpoints tag can for example be found here: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`. We supply a different stage 2 checkpoint -- with a more condensed style of transcribing -- under the name `subtitle`.
### Evaluation
#### WER
| Model size | | FLEURS | CommonVoice | NST |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** |
| | OpenAI | 59.2 | 67.8 | 85.2 |
| [base](https://huggingface.co/KBLab/kb-whisper-base) | **KBLab** | **9.1** | **8.7** | **7.8** |
| | OpenAI | 39.6 | 52.1 | 53.4 |
| [small](https://huggingface.co/KBLab/kb-whisper-small) | **KBLab** | **7.3** | **6.4** | **6.6** |
| | OpenAI | 20.6 | 26.4 | 26.4 |
| [medium](https://huggingface.co/KBLab/kb-whisper-medium) | **KBLab** | **6.6** | **5.4** | **5.8** |
| | OpenAI | 12.1 | 15.8 | 17.1 |
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** |
| | OpenAI | 7.8 | 9.5 | 11.3 |
#### BLEU Score
| Model size | | FLEURS | CommonVoice | NST |
|------------|---------|--------|-------------|------|
| tiny | KBLab | **76.6** | **73.7** | **74.3** |
| | OpenAI | 26.9 | 21.1 | 24.0 |
| base | KBLab | **83.2** | **79.9** | **78.3** |
| | OpenAI | 41.1 | 32.5 | 36.9 |
| small | KBLab | **86.6** | **83.5** | **79.6** |
| | OpenAI | 64.0 | 56.5 | 58.2 |
| medium | KBLab | **87.6** | **85.0** | **80.2** |
| | OpenAI | 77.1 | 70.1 | 68.9 |
| large-v3 | KBLab | **89.8** | **87.2** | **81.1** |
| | OpenAI | 84.9 | 79.1 | 75.1 |
### Acknowledgements
We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium through an EuroHPC AI and Data-Intensive Applications Access call.
### Citation
Paper reference coming soon.
|