--- language: - zh license: apache-2.0 tags: - whisper-event datasets: - mozilla-foundation/common_voice_11_0 model-index: - name: Whisper Small zh-HK - Alvin results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: mozilla-foundation/common_voice_11_0 zh-HK type: mozilla-foundation/common_voice_11_0 config: zh-HK split: test args: zh-HK metrics: - name: Normalized CER type: cer value: 7.766 metrics: - cer pipeline_tag: automatic-speech-recognition --- # Whisper Large V2 zh-HK - Alvin This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on the Common Voice 11.0 dataset. This is trained with PEFT LoRA+BNB INT8 with a Normalized CER of 7.77% To use the model, use the following code. It should be able to inference with less than 4GB VRAM (batch size of 1). ``` from peft import PeftModel, PeftConfig from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer, WhisperTokenizer, WhisperProcessor peft_model_id = "alvanlii/whisper-largev2-cantonese-peft-lora" peft_config = PeftConfig.from_pretrained(peft_model_id) model = WhisperForConditionalGeneration.from_pretrained( peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto" ) model = PeftModel.from_pretrained(model, peft_model_id) task = "transcribe" tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, task=task) processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, task=task) feature_extractor = processor.feature_extractor forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task) pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor) audio = # load audio here text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"] ``` ## Training and evaluation data For training, three datasets were used: - Common Voice 11 Canto Train Set - CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906. - Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf ## Training Hyperparameters - learning_rate: 1e-3 - train_batch_size: 60 (on 1 3090 GPU) - eval_batch_size: 10 - gradient_accumulation_steps: 1 - total_train_batch_size: 60x1x1=60 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - training_steps: 12000 - augmentation: SpecAugment ## Training Results | Training Loss | Epoch | Step | Validation Loss | Normalized CER | |:-------------:|:-----:|:----:|:---------------:|:--------------:| | 0.8604 | 1.99 | 12000 | 0.2129 | 0.07766 |