Automatic Speech Recognition
Safetensors
Xhosa
whisper
audio
whisper-small-xhosa / README.md
wjbmattingly's picture
Update README.md
871e0c1 verified
|
raw
history blame
2.75 kB
metadata
language:
  - xh
pipeline_tag: automatic-speech-recognition
tags:
  - audio
  - automatic-speech-recognition
widget:
  - example_title: Librispeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: Librispeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
datasets:
  - Beijuka/xhosa_parakeet_50hr
metrics:
  - wer
base_model: openai/whisper-small
license: apache-2.0

language: - xh pipeline_tag: automatic-speech-recognition tags: - audio - automatic-speech-recognition widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac datasets: - Beijuka/xhosa_parakeet_50hr metrics: - wer base_model: openai/whisper-small

Whisper-Small Fine-tuned for isiXhosa ASR

Model Description

This model is a fine-tuned version of OpenAI's Whisper-small, optimized for isiXhosa Automatic Speech Recognition (ASR). It has been trained on the NCHLT isiXhosa Speech Corpus to improve its performance on isiXhosa speech transcription tasks.

Performance

  • Word Error Rate (WER): 32%

Base Model

  • Name: openai/whisper-small
  • Type: Automatic Speech Recognition (ASR)
  • Original language: Multilingual

Usage

To use this model for inference:

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("TheirStory-Inc/whisper-small-xhosa")
processor = WhisperProcessor.from_pretrained("TheirStory-Inc/whisper-small-xhosa")

# Prepare your audio file (16kHz sampling rate)
audio_input = ...  # Load your audio file here

# Process the audio
input_features = processor(audio_input, sampling_rate=16000, return_tensors="pt").input_features

# Generate token ids
predicted_ids = model.generate(input_features)

# Decode the token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

print(transcription)

Fine-tuning Dataset

  • Name: NCHLT isiXhosa Speech Corpus
  • Size: Approximately 56 hours of transcribed speech
  • Speakers: 209 (106 female, 103 male)
  • Content: Prompted speech (3-5 word utterances read from a smartphone screen)
  • Source: Audio recordings smartphone-collected in non-studio environment
  • License: Creative Commons Attribution 3.0 Unported License (CC BY 3.0)
De Vries, N.J., Davel, M.H., Badenhorst, J., Basson, W.D., de Wet, F., Barnard, E. and de Waal, A. (2014). A smartphone-based ASR data collection tool for under-resourced languages. Speech Communication, 56, 119-131. https://hdl.handle.net/20.500.12185/279