alvanlii's picture
Add model usage instructions
72b901d
|
raw
history blame
3.79 kB
metadata
language:
  - zh
license: apache-2.0
tags:
  - whisper-event
  - generated_from_trainer
datasets:
  - mozilla-foundation/common_voice_11_0
model-index:
  - name: Whisper Small zh-HK - Alvin
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: mozilla-foundation/common_voice_11_0 zh-HK
          type: mozilla-foundation/common_voice_11_0
          config: zh-HK
          split: test
          args: zh-HK
        metrics:
          - name: Normalized CER
            type: cer
            value: 10.11

Whisper Small zh-HK - Alvin

This model is a fine-tuned version of openai/whisper-small on the Common Voice 11.0 dataset. This version has a lower CER (by 1%) compared to the previous one.

Training and evaluation data

For training, three datasets were used:

  • Common Voice 11 Canto Train Set
  • CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
  • Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf

Using the Model

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
    input_features=processed_in.input_features, 
    output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)
  • Alternatively, you can use huggingface pipelines
from transformers import pipeline
MODEL_NAME = "alvanlii/whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]

Training Hyperparameters

  • learning_rate: 5e-5
  • train_batch_size: 25 (on 2 GPUs)
  • eval_batch_size: 8
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 25x2x2=100
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • training_steps: 14000
  • mixed_precision_training: Native AMP
  • augmentation: SpecAugment

Training Results

Training Loss Epoch Step Validation Loss Normalized CER
0.4610 0.55 2000 0.3106 13.08
0.3441 1.11 4000 0.2875 11.79
0.3466 1.66 6000 0.2820 11.44
0.2539 2.22 8000 0.2777 10.59
0.2312 2.77 10000 0.2822 10.60
0.1639 3.32 12000 0.2859 10.17
0.1569 3.88 14000 0.2866 10.11