--- language: - kk tags: - audio - automatic-speech-recognition - kazakh-asr widget: - src: https://drive.google.com/file/d/1udN8ybS7Ih3ESuoYZlaei4RcIPVbJlAf/view?usp=sharing example_title: sample model-index: - name: whisper-base.kk results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Kazakh Speech Corpus 2 (KSC2) type: librispeech_asr config: clean split: test args: language: kk metrics: - name: Test WER type: wer value: 15.36 pipeline_tag: automatic-speech-recognition license: apache-2.0 --- # Whisper Whisper-base for automatic speech recognition (ASR) for the low-resourced Kazakh language. The model was fine-tuned on the [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/2022/12/13/ksc2-an-industrial-scale-open-source-kazakh-speech-corpus/) with over 1k hours of labelled data. The model achieved 15.36% WER on the test set. # Usage This checkpoint is a *Kazakh-only* model, meaning it can be used *only* for Kazakh speech recognition. ## Transcription ```python >>> from transformers import WhisperProcessor, WhisperForConditionalGeneration >>> import librosa >>> # load model and processor >>> processor = WhisperProcessor.from_pretrained("akuzdeuov/whisper-base.kk") >>> model = WhisperForConditionalGeneration.from_pretrained("akuzdeuov/whisper-base.kk") >>> # load your audio >>> audio, sampling_rate = librosa.load("path_to_audio", sr=16000) >>> input_features = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features >>> # generate token ids >>> predicted_ids = model.generate(input_features) >>> # decode token ids to text >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False) >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) ``` The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`. ## Long-Form Transcription The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. ```python >>> import torch >>> from transformers import pipeline >>> device = "cuda:0" if torch.cuda.is_available() else "cpu" >>> pipe = pipeline( >>> "automatic-speech-recognition", >>> model="akuzdeuov/whisper-base.kk", >>> chunk_length_s=30, >>> device=device, >>> ) >>> prediction = pipe("path_to_audio", batch_size=8)["text"] ``` ## References 1. [Whisper, OpenAI.](https://huggingface.co/openai/whisper-base.en)