This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Bingsu/zeroth-korean, google/flerus in 5 epochs.

This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean.

Evaluation

Evaluation by

from whisper_normalizer.basic import BasicTextNormalizer
from evaluate import load

normalizer = BasicTextNormalizer()
cer_metric = load("cer")
wer_metric = load("wer")
Model zeroth-test-BLEU zeroth-test-CER zeroth-test-WER fleurs-test-BLEU fleurs-test-CER fleurs-test-WER
original 0.071 126.4 121.5 0.010 115.7 112.8
finetune (this model) 94.837 1.429 2.951 67.659 7.951 18.313

Evaluation was done on the following datasets:

  • ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
  • AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from here.

Compared to Phi-4-mm-inst-zeroth-kor and Phi-4-multimodal-finetune-ko-speech, ASR is significantly improved.

Model zeroth-test fleurs-ko2en fleurs-ko2en-cot fleurs-en2ko fleurs-en2ko-cot
original 198.32 5.63 2.42 6.86 4.17
finetune (this model) 1.31 7.46 6.24 12.15 8.91
daekeun-ml/Phi-4-multimodal-finetune-ko-speech 3.80 7.03 7.04 12.50 9.54
seastar105/Phi-4-mm-inst-zeroth-kor 7.02 7.07 9.19 13.08 9.35
Downloads last month
28
Safetensors
Model size
5.57B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Model tree for junnei/Phi-4-multimodal-instruct-ko-asr

Finetuned
(7)
this model

Datasets used to train junnei/Phi-4-multimodal-instruct-ko-asr

Evaluation results