|
--- |
|
language: |
|
- kk |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- kazakh-asr |
|
widget: |
|
- src: https://drive.google.com/file/d/1udN8ybS7Ih3ESuoYZlaei4RcIPVbJlAf/view?usp=sharing |
|
example_title: sample |
|
|
|
model-index: |
|
- name: whisper-base.kk |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Kazakh Speech Corpus 2 (KSC2) |
|
type: librispeech_asr |
|
config: clean |
|
split: test |
|
args: |
|
language: kk |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 15.36 |
|
pipeline_tag: automatic-speech-recognition |
|
license: apache-2.0 |
|
--- |
|
# Whisper |
|
|
|
Whisper-base for automatic speech recognition (ASR) for the low-resourced Kazakh language. The model was fine-tuned on the [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/2022/12/13/ksc2-an-industrial-scale-open-source-kazakh-speech-corpus/) |
|
with over 1k hours of labelled data. The model achieved 15.36% WER on the test set. |
|
|
|
# Usage |
|
|
|
This checkpoint is a *Kazakh-only* model, meaning it can be used *only* for Kazakh speech recognition. |
|
|
|
## Transcription |
|
|
|
```python |
|
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
>>> import librosa |
|
|
|
>>> # load model and processor |
|
>>> processor = WhisperProcessor.from_pretrained("akuzdeuov/whisper-base.kk") |
|
>>> model = WhisperForConditionalGeneration.from_pretrained("akuzdeuov/whisper-base.kk") |
|
|
|
>>> # load your audio |
|
>>> audio, sampling_rate = librosa.load("path_to_audio", sr=16000) |
|
>>> input_features = processor(audio, sampling_rate=sampling_rate, return_tensors="pt").input_features |
|
|
|
>>> # generate token ids |
|
>>> predicted_ids = model.generate(input_features) |
|
>>> # decode token ids to text |
|
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False) |
|
|
|
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
``` |
|
The context tokens can be removed from the start of the transcription by setting `skip_special_tokens=True`. |
|
|
|
|
|
## Long-Form Transcription |
|
|
|
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking |
|
algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers |
|
[`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) |
|
method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline |
|
can be run with batched inference. |
|
|
|
```python |
|
>>> import torch |
|
>>> from transformers import pipeline |
|
|
|
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
|
|
>>> pipe = pipeline( |
|
>>> "automatic-speech-recognition", |
|
>>> model="akuzdeuov/whisper-base.kk", |
|
>>> chunk_length_s=30, |
|
>>> device=device, |
|
>>> ) |
|
|
|
>>> prediction = pipe("path_to_audio", batch_size=8)["text"] |
|
``` |
|
|
|
## References |
|
1. [Whisper, OpenAI.](https://huggingface.co/openai/whisper-base.en) |