junnei/Phi-4-multimodal-instruct-ko-asr

This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Bingsu/zeroth-korean, google/flerus in 5 epochs.

This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean.

Evaluation

Evaluation by

from whisper_normalizer.basic import BasicTextNormalizer
from evaluate import load

normalizer = BasicTextNormalizer()
cer_metric = load("cer")
wer_metric = load("wer")

Model	zeroth-test-BLEU	zeroth-test-CER	zeroth-test-WER	fleurs-test-BLEU	fleurs-test-CER	fleurs-test-WER
original	0.071	126.4	121.5	0.010	115.7	112.8
finetune (this model)	94.837	1.429	2.951	67.659	7.951	18.313

Evaluation was done on the following datasets:

ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from here.

Compared to Phi-4-mm-inst-zeroth-kor and Phi-4-multimodal-finetune-ko-speech, ASR is significantly improved.

Model	zeroth-test	fleurs-ko2en	fleurs-ko2en-cot	fleurs-en2ko	fleurs-en2ko-cot
original	198.32	5.63	2.42	6.86	4.17
finetune (this model)	1.31	7.46	6.24	12.15	8.91
daekeun-ml/Phi-4-multimodal-finetune-ko-speech	3.80	7.03	7.04	12.50	9.54
seastar105/Phi-4-mm-inst-zeroth-kor	7.02	7.07	9.19	13.08	9.35

junnei
/

Phi-4-multimodal-instruct-ko-asr

Evaluation

Model tree for junnei/Phi-4-multimodal-instruct-ko-asr

Datasets used to train junnei/Phi-4-multimodal-instruct-ko-asr

Evaluation results