kb-whisper-large / README.md

Add metadata

2c86b70 verified 5 days ago

5.82 kB

	---
	library_name: transformers
	base_model: openai/whisper-large
	language:
	- sv
	pipeline_tag: automatic-speech-recognition
	license: apache-2.0
	datasets:
	- KBLab/rixvox-v2
	---
	## KB-Whisper Large

	The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co/datasets/google/fleurs), [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size).

	\| Model size \| \| FLEURS \| CommonVoice \| NST \|
	\|------------\|---------\|--------\|-------------\|------\|
	\| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) \| KBLab \| 13.2 \| 12.9 \| 11.2 \|
	\| \| OpenAI \| 59.2 \| 67.8 \| 85.2 \|
	\| [base](https://huggingface.co/KBLab/kb-whisper-base) \| KBLab \| 9.1 \| 8.7 \| 7.8 \|
	\| \| OpenAI \| 39.6 \| 52.1 \| 53.4 \|
	\| [small](https://huggingface.co/KBLab/kb-whisper-small) \| KBLab \| 7.3 \| 6.4 \| 6.6 \|
	\| \| OpenAI \| 20.6 \| 26.4 \| 26.4 \|
	\| [medium](https://huggingface.co/KBLab/kb-whisper-medium) \| KBLab \| 6.6 \| 5.4 \| 5.8 \|
	\| \| OpenAI \| 12.1 \| 15.8 \| 17.1 \|
	\| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) \| KBLab \| 5.4 \| 4.1 \| 5.2 \|
	\| \| OpenAI \| 7.8 \| 9.5 \| 11.3 \|

	Table: Word Error Rate (WER) comparison between KBLab's Whisper models and the corresponding OpenAI versions.

	### Usage

	```python
	import torch
	from datasets import load_dataset
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
	model_id = "KBLab/kb-whisper-large"

	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
	)
	model.to(device)
	processor = AutoProcessor.from_pretrained(model_id)

	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	torch_dtype=torch_dtype,
	device=device,
	)

	generate_kwargs = {"task": "transcribe", "language": "sv"}
	# Add return_timestamps=True for output with timestamps
	res = pipe("audio.mp3",
	chunk_length_s=30,
	generate_kwargs={"task": "transcribe", "language": "sv"})
	```

	### Training data

	Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.

	Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).

	\| Dataset \| Continued pretraining (h) -- Stage 1 \| Finetuning (h) -- Stage 2 \|
	\|-------------\|--------------------------\|--------------\|
	\| Subtitles \| 34,261 \| 3,110 \|
	\| Riksdag \| 21,949 \| 5,119 \|
	\| ISOF \| 54 \| 54 \|
	\| NST \| 250 \| 250 \|
	\| Total \| 56,514 \| 8,533 \|

	The default when loading our models through Hugging Face is Stage 2. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`.

	### Evaluation


	#### WER
	\| Model size \| \| FLEURS \| CommonVoice \| NST \|
	\|------------\|---------\|--------\|-------------\|------\|
	\| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) \| KBLab \| 13.2 \| 12.9 \| 11.2 \|
	\| \| OpenAI \| 59.2 \| 67.8 \| 85.2 \|
	\| [base](https://huggingface.co/KBLab/kb-whisper-base) \| KBLab \| 9.1 \| 8.7 \| 7.8 \|
	\| \| OpenAI \| 39.6 \| 52.1 \| 53.4 \|
	\| [small](https://huggingface.co/KBLab/kb-whisper-small) \| KBLab \| 7.3 \| 6.4 \| 6.6 \|
	\| \| OpenAI \| 20.6 \| 26.4 \| 26.4 \|
	\| [medium](https://huggingface.co/KBLab/kb-whisper-medium) \| KBLab \| 6.6 \| 5.4 \| 5.8 \|
	\| \| OpenAI \| 12.1 \| 15.8 \| 17.1 \|
	\| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) \| KBLab \| 5.4 \| 4.1 \| 5.2 \|
	\| \| OpenAI \| 7.8 \| 9.5 \| 11.3 \|


	#### BLEU Score
	\| Model size \| \| FLEURS \| CommonVoice \| NST \|
	\|------------\|---------\|--------\|-------------\|------\|
	\| tiny \| KBLab \| 76.6 \| 73.7 \| 74.3 \|
	\| \| OpenAI \| 26.9 \| 21.1 \| 24.0 \|
	\| base \| KBLab \| 83.2 \| 79.9 \| 78.3 \|
	\| \| OpenAI \| 41.1 \| 32.5 \| 36.9 \|
	\| small \| KBLab \| 86.6 \| 83.5 \| 79.6 \|
	\| \| OpenAI \| 64.0 \| 56.5 \| 58.2 \|
	\| medium \| KBLab \| 87.6 \| 85.0 \| 80.2 \|
	\| \| OpenAI \| 77.1 \| 70.1 \| 68.9 \|
	\| large-v3 \| KBLab \| 89.8 \| 87.2 \| 81.1 \|
	\| \| OpenAI \| 84.9 \| 79.1 \| 75.1 \|

	---
	library_name: transformers
	base_model: openai/whisper-large
	language:
	- sv
	pipeline_tag: automatic-speech-recognition
	license: apache-2.0
	datasets:
	- KBLab/rixvox-v2
	---
	## KB-Whisper Large

	The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co/datasets/google/fleurs), [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size).

	\| Model size \| \| FLEURS \| CommonVoice \| NST \|
	\|------------\|---------\|--------\|-------------\|------\|
	\| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) \| KBLab \| 13.2 \| 12.9 \| 11.2 \|
	\| \| OpenAI \| 59.2 \| 67.8 \| 85.2 \|
	\| [base](https://huggingface.co/KBLab/kb-whisper-base) \| KBLab \| 9.1 \| 8.7 \| 7.8 \|
	\| \| OpenAI \| 39.6 \| 52.1 \| 53.4 \|
	\| [small](https://huggingface.co/KBLab/kb-whisper-small) \| KBLab \| 7.3 \| 6.4 \| 6.6 \|
	\| \| OpenAI \| 20.6 \| 26.4 \| 26.4 \|
	\| [medium](https://huggingface.co/KBLab/kb-whisper-medium) \| KBLab \| 6.6 \| 5.4 \| 5.8 \|
	\| \| OpenAI \| 12.1 \| 15.8 \| 17.1 \|
	\| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) \| KBLab \| 5.4 \| 4.1 \| 5.2 \|
	\| \| OpenAI \| 7.8 \| 9.5 \| 11.3 \|

	Table: Word Error Rate (WER) comparison between KBLab's Whisper models and the corresponding OpenAI versions.

	### Usage

	```python
	import torch
	from datasets import load_dataset
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
	model_id = "KBLab/kb-whisper-large"

	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
	)
	model.to(device)
	processor = AutoProcessor.from_pretrained(model_id)

	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	torch_dtype=torch_dtype,
	device=device,
	)

	generate_kwargs = {"task": "transcribe", "language": "sv"}
	# Add return_timestamps=True for output with timestamps
	res = pipe("audio.mp3",
	chunk_length_s=30,
	generate_kwargs={"task": "transcribe", "language": "sv"})
	```

	### Training data

	Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.

	Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).

	\| Dataset \| Continued pretraining (h) -- Stage 1 \| Finetuning (h) -- Stage 2 \|
	\|-------------\|--------------------------\|--------------\|
	\| Subtitles \| 34,261 \| 3,110 \|
	\| Riksdag \| 21,949 \| 5,119 \|
	\| ISOF \| 54 \| 54 \|
	\| NST \| 250 \| 250 \|
	\| Total \| 56,514 \| 8,533 \|

	The default when loading our models through Hugging Face is Stage 2. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can these other checkpoints by specifying the `revision`. For example: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`.

	### Evaluation


	#### WER
	\| Model size \| \| FLEURS \| CommonVoice \| NST \|
	\|------------\|---------\|--------\|-------------\|------\|
	\| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny) \| KBLab \| 13.2 \| 12.9 \| 11.2 \|
	\| \| OpenAI \| 59.2 \| 67.8 \| 85.2 \|
	\| [base](https://huggingface.co/KBLab/kb-whisper-base) \| KBLab \| 9.1 \| 8.7 \| 7.8 \|
	\| \| OpenAI \| 39.6 \| 52.1 \| 53.4 \|
	\| [small](https://huggingface.co/KBLab/kb-whisper-small) \| KBLab \| 7.3 \| 6.4 \| 6.6 \|
	\| \| OpenAI \| 20.6 \| 26.4 \| 26.4 \|
	\| [medium](https://huggingface.co/KBLab/kb-whisper-medium) \| KBLab \| 6.6 \| 5.4 \| 5.8 \|
	\| \| OpenAI \| 12.1 \| 15.8 \| 17.1 \|
	\| [large-v3](https://huggingface.co/KBLab/kb-whisper-large) \| KBLab \| 5.4 \| 4.1 \| 5.2 \|
	\| \| OpenAI \| 7.8 \| 9.5 \| 11.3 \|


	#### BLEU Score
	\| Model size \| \| FLEURS \| CommonVoice \| NST \|
	\|------------\|---------\|--------\|-------------\|------\|
	\| tiny \| KBLab \| 76.6 \| 73.7 \| 74.3 \|
	\| \| OpenAI \| 26.9 \| 21.1 \| 24.0 \|
	\| base \| KBLab \| 83.2 \| 79.9 \| 78.3 \|
	\| \| OpenAI \| 41.1 \| 32.5 \| 36.9 \|
	\| small \| KBLab \| 86.6 \| 83.5 \| 79.6 \|
	\| \| OpenAI \| 64.0 \| 56.5 \| 58.2 \|
	\| medium \| KBLab \| 87.6 \| 85.0 \| 80.2 \|
	\| \| OpenAI \| 77.1 \| 70.1 \| 68.9 \|
	\| large-v3 \| KBLab \| 89.8 \| 87.2 \| 81.1 \|
	\| \| OpenAI \| 84.9 \| 79.1 \| 75.1 \|