File size: 2,162 Bytes
e203ef7 702ac6e 50db92e 702ac6e a9d4da4 a3c0327 50db92e a3c0327 80faee4 871e0c1 a3c0327 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
language:
- xh
pipeline_tag: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
datasets:
- Beijuka/xhosa_parakeet_50hr
- wjbmattingly/xhosa_merged_audio
metrics:
- wer
base_model: openai/whisper-small
license: apache-2.0
---
# Whisper-Small Fine-tuned for isiXhosa ASR
## Model Description
This model is a fine-tuned version of OpenAI's Whisper-small, optimized for isiXhosa Automatic Speech Recognition (ASR). It has been trained on the NCHLT isiXhosa Speech Corpus to improve its performance on isiXhosa speech transcription tasks.
## Performance
- Word Error Rate (WER): 29.73%
## Base Model
- Name: openai/whisper-small
- Type: Automatic Speech Recognition (ASR)
- Original language: Multilingual
## Usage
To use this model for inference:
```python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("TheirStory-Inc/whisper-small-xhosa")
processor = WhisperProcessor.from_pretrained("TheirStory-Inc/whisper-small-xhosa")
# Prepare your audio file (16kHz sampling rate)
audio_input = ... # Load your audio file here
# Process the audio
input_features = processor(audio_input, sampling_rate=16000, return_tensors="pt").input_features
# Generate token ids
predicted_ids = model.generate(input_features)
# Decode the token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)
```
## Fine-tuning Dataset
- Name: NCHLT isiXhosa Speech Corpus
- Size: Approximately 56 hours of transcribed speech
- Speakers: 209 (106 female, 103 male)
- Content: Prompted speech (3-5 word utterances read from a smartphone screen)
- Source: Audio recordings smartphone-collected in non-studio environment
- License: Creative Commons Attribution 3.0 Unported License (CC BY 3.0)
### Citation
```tex
De Vries, N.J., Davel, M.H., Badenhorst, J., Basson, W.D., de Wet, F., Barnard, E. and de Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56, 119-131. https://hdl.handle.net/20.500.12185/279
``` |