|
--- |
|
library_name: transformers |
|
base_model: openai/whisper-large |
|
language: |
|
- sv |
|
pipeline_tag: automatic-speech-recognition |
|
license: apache-2.0 |
|
datasets: |
|
- KBLab/rixvox-v2 |
|
--- |
|
## KB-Whisper Large |
|
|
|
The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co/datasets/google/fleurs), [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size). |
|
|
|
| Model size | | FLEURS | CommonVoice | NST | |
|
|------------|---------|--------|-------------|------| |
|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** | |
|
| | OpenAI | 59.2 | 67.8 | 85.2 | |
|
| [base](https://huggingface.co/KBLab/kb-whisper-base) | **KBLab** | **9.1** | **8.7** | **7.8** | |
|
| | OpenAI | 39.6 | 52.1 | 53.4 | |
|
| [small](https://huggingface.co/KBLab/kb-whisper-small) | **KBLab** | **7.3** | **6.4** | **6.6** | |
|
| | OpenAI | 20.6 | 26.4 | 26.4 | |
|
| [medium](https://huggingface.co/KBLab/kb-whisper-medium) | **KBLab** | **6.6** | **5.4** | **5.8** | |
|
| | OpenAI | 12.1 | 15.8 | 17.1 | |
|
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** | |
|
| | OpenAI | 7.8 | 9.5 | 11.3 | |
|
|
|
Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and the corresponding OpenAI versions. |
|
|
|
### Usage |
|
|
|
```python |
|
import torch |
|
from datasets import load_dataset |
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
model_id = "KBLab/kb-whisper-large" |
|
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache" |
|
) |
|
model.to(device) |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model=model, |
|
tokenizer=processor.tokenizer, |
|
feature_extractor=processor.feature_extractor, |
|
torch_dtype=torch_dtype, |
|
device=device, |
|
) |
|
|
|
generate_kwargs = {"task": "transcribe", "language": "sv"} |
|
# Add return_timestamps=True for output with timestamps |
|
res = pipe("audio.mp3", |
|
chunk_length_s=30, |
|
generate_kwargs={"task": "transcribe", "language": "sv"}) |
|
``` |
|
|
|
### Training data |
|
|
|
Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters. |
|
|
|
Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`). |
|
|
|
| Dataset | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 | |
|
|-------------|--------------------------|--------------| |
|
| Subtitles | 34,261 | 3,110 | |
|
| Riksdag | 21,949 | 5,119 | |
|
| ISOF | 54 | 54 | |
|
| NST | 250 | 250 | |
|
| **Total** | **56,514** | **8,533** | |
|
|
|
The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`. |
|
|
|
### Evaluation |
|
|
|
|
|
#### WER |
|
| Model size | | FLEURS | CommonVoice | NST | |
|
|------------|---------|--------|-------------|------| |
|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) | **KBLab** | **13.2** | **12.9** | **11.2** | |
|
| | OpenAI | 59.2 | 67.8 | 85.2 | |
|
| [base](https://huggingface.co/KBLab/kb-whisper-base) | **KBLab** | **9.1** | **8.7** | **7.8** | |
|
| | OpenAI | 39.6 | 52.1 | 53.4 | |
|
| [small](https://huggingface.co/KBLab/kb-whisper-small) | **KBLab** | **7.3** | **6.4** | **6.6** | |
|
| | OpenAI | 20.6 | 26.4 | 26.4 | |
|
| [medium](https://huggingface.co/KBLab/kb-whisper-medium) | **KBLab** | **6.6** | **5.4** | **5.8** | |
|
| | OpenAI | 12.1 | 15.8 | 17.1 | |
|
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) | **KBLab** | **5.4** | **4.1** | **5.2** | |
|
| | OpenAI | 7.8 | 9.5 | 11.3 | |
|
|
|
|
|
#### BLEU Score |
|
| Model size | | FLEURS | CommonVoice | NST | |
|
|------------|---------|--------|-------------|------| |
|
| tiny | KBLab | **76.6** | **73.7** | **74.3** | |
|
| | OpenAI | 26.9 | 21.1 | 24.0 | |
|
| base | KBLab | **83.2** | **79.9** | **78.3** | |
|
| | OpenAI | 41.1 | 32.5 | 36.9 | |
|
| small | KBLab | **86.6** | **83.5** | **79.6** | |
|
| | OpenAI | 64.0 | 56.5 | 58.2 | |
|
| medium | KBLab | **87.6** | **85.0** | **80.2** | |
|
| | OpenAI | 77.1 | 70.1 | 68.9 | |
|
| large-v3 | KBLab | **89.8** | **87.2** | **81.1** | |
|
| | OpenAI | 84.9 | 79.1 | 75.1 | |
|
|