--- language: - xh pipeline_tag: automatic-speech-recognition tags: - audio - automatic-speech-recognition datasets: - Beijuka/xhosa_parakeet_50hr - wjbmattingly/xhosa_merged_audio metrics: - wer base_model: openai/whisper-small license: apache-2.0 --- # Whisper-Small Fine-tuned for isiXhosa ASR ## Model Description This model is a fine-tuned version of OpenAI's Whisper-small, optimized for isiXhosa Automatic Speech Recognition (ASR). It has been trained on the NCHLT isiXhosa Speech Corpus to improve its performance on isiXhosa speech transcription tasks. ## Performance - Word Error Rate (WER): 29.73% ## Base Model - Name: openai/whisper-small - Type: Automatic Speech Recognition (ASR) - Original language: Multilingual ## Usage To use this model for inference: ```python from transformers import WhisperForConditionalGeneration, WhisperProcessor import torch # Load model and processor model = WhisperForConditionalGeneration.from_pretrained("TheirStory-Inc/whisper-small-xhosa") processor = WhisperProcessor.from_pretrained("TheirStory-Inc/whisper-small-xhosa") # Prepare your audio file (16kHz sampling rate) audio_input = ... # Load your audio file here # Process the audio input_features = processor(audio_input, sampling_rate=16000, return_tensors="pt").input_features # Generate token ids predicted_ids = model.generate(input_features) # Decode the token ids to text transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) print(transcription) ``` ## Fine-tuning Dataset - Name: NCHLT isiXhosa Speech Corpus - Size: Approximately 56 hours of transcribed speech - Speakers: 209 (106 female, 103 male) - Content: Prompted speech (3-5 word utterances read from a smartphone screen) - Source: Audio recordings smartphone-collected in non-studio environment - License: Creative Commons Attribution 3.0 Unported License (CC BY 3.0) ### Citation ```tex De Vries, N.J., Davel, M.H., Badenhorst, J., Basson, W.D., de Wet, F., Barnard, E. and de Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56, 119-131. https://hdl.handle.net/20.500.12185/279 ```