--- license: apache-2.0 datasets: - doof-ferb/vlsp2020_vinai_100h - doof-ferb/fpt_fosd - doof-ferb/infore1_25hours - doof-ferb/infore2_audiobooks - quocanh34/viet_vlsp - linhtran92/final_dataset_500hrs_wer0 - linhtran92/viet_youtube_asr_corpus_v2 - google/fleurs - mozilla-foundation/common_voice_16_1 - vivos language: ["vi"] metrics: ["wer"] library_name: transformers base_model: openai/whisper-tiny pipeline_tag: automatic-speech-recognition model-index: - name: doof-ferb/whisper-tiny-vi results: - task: type: automatic-speech-recognition dataset: type: mozilla-foundation/common_voice_16_1 name: Mozilla CommonVoice (Vietnamese) v16.1 config: vi split: test metrics: - type: wer value: 26.6 verified: false - task: type: automatic-speech-recognition dataset: type: google/fleurs name: Google FLEURS (Vietnamese) config: vi_vn split: test metrics: - type: wer value: 37.1 verified: false - task: type: automatic-speech-recognition dataset: type: vivos name: ĐHQG TPHCM VIVOS split: test metrics: - type: wer value: 18.7 verified: false --- whisper tiny fine-tuned on a very big collection of vietnamese speech datasets TODO: - [x] training then publish checkpoint - [x] evaluate WER on Common Voice & FLEURS & VIVOS - [ ] convert to `openai-whisper`, `whisper.cpp`, `faster-whisper` - [ ] convert to ONNX: to try https://github.com/k2-fsa/sherpa-onnx & https://github.com/zhuzilin/whisper-openvino - [ ] convert to TensorRT: https://github.com/openai/whisper/discussions/169 21k steps, warm-up 5%, batch size 16×2 (kaggle free T4×2) manually evaluate WER on test set - vietnamese part: | @ `float16` | `CommonVoice v16.1` | `FLEURS` | `VIVOS` | |---|---|---|---| | original `whisper-tiny` | >100% | 88.6% | 62.5% | | this model | 26.6% | 37.1% | 18.7% | all training + evaluation scripts are on my repo: https://github.com/phineas-pta/fine-tune-whisper-vi usage example: ```python import torch from transformers import pipeline PIPE = pipeline(task="automatic-speech-recognition", model="doof-ferb/whisper-tiny-vi", device="cuda:0", torch_dtype=torch.float16) PIPE_KWARGS = {"language": "vi", "task": "transcribe"} PIPE("audio.mp3", generate_kwargs=PIPE_KWARGS)["text"] ```