akuzdeuov
/

whisper-base.kk

Automatic Speech Recognition

Model card Files Files and versions Community

whisper-base.kk / README.md

akuzdeuov's picture

Update README.md

2f0727c verified 4 months ago

|

history blame contribute delete

2.99 kB

	---
	language:
	- kk
	tags:
	- audio
	- automatic-speech-recognition
	- kazakh-asr
	widget:
	- src: https://drive.google.com/file/d/1udN8ybS7Ih3ESuoYZlaei4RcIPVbJlAf/view?usp=sharing
	example_title: sample

	model-index:
	- name: whisper-base.kk
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Kazakh Speech Corpus 2 (KSC2)
	type: librispeech_asr
	config: clean
	split: test
	args:
	language: kk
	metrics:
	- name: Test WER
	type: wer
	value: 15.36
	pipeline_tag: automatic-speech-recognition
	license: apache-2.0
	---
	# Whisper

	Whisper-base for automatic speech recognition (ASR) for the low-resourced Kazakh language. The model was fine-tuned on the [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/2022/12/13/ksc2-an-industrial-scale-open-source-kazakh-speech-corpus/)
	with over 1k hours of labelled data. The model achieved 15.36% WER on the test set.

	# Usage

	This checkpoint is a Kazakh-only model, meaning it can be used only for Kazakh speech recognition.

	## Transcription

	```python
	>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
	>>> import librosa

	>>> # load model and processor
	>>> processor = WhisperProcessor.from_pretrained("akuzdeuov/whisper-base.kk")
	>>> model = WhisperForConditionalGeneration.from_pretrained("akuzdeuov/whisper-base.kk")

	>>> # load your audio
	>>> audio, sampling_rate = librosa.load("path_to_audio", sr=16000)
	>>> input_features = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features

	>>> # generate token ids
	>>> predicted_ids = model.generate(input_features)
	>>> # decode token ids to text
	>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)

	>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
	```
	The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`.


	## Long-Form Transcription

	The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
	algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
	[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
	method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
	can be run with batched inference.

	```python
	>>> import torch
	>>> from transformers import pipeline

	>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

	>>> pipe = pipeline(
	>>> "automatic-speech-recognition",
	>>> model="akuzdeuov/whisper-base.kk",
	>>> chunk_length_s=30,
	>>> device=device,
	>>> )

	>>> prediction = pipe("path_to_audio", batch_size=8)["text"]
	```

	## References
	1. [Whisper, OpenAI.](https://huggingface.co/openai/whisper-base.en)