File size: 6,870 Bytes
4c63337
 
 
 
 
 
 
 
 
 
6accd21
 
 
 
 
 
 
 
 
0602162
 
 
 
 
 
 
6accd21
0602162
 
 
6accd21
0602162
 
6accd21
0602162
6accd21
0602162
 
 
6accd21
0602162
 
 
 
6accd21
0602162
 
 
6accd21
4c63337
 
 
 
 
 
 
0602162
 
a780aa8
4c63337
 
 
99edef4
4c63337
0602162
 
4c63337
 
 
 
59813d5
 
 
 
0602162
 
f44c7a9
0602162
 
 
 
57010c0
 
59813d5
0602162
f44c7a9
 
 
 
 
 
 
 
 
4a69ef7
f44c7a9
 
 
 
 
4a69ef7
 
 
 
f44c7a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a28266d
 
 
0602162
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
datasets:
- kresnik/zeroth_korean
- mozilla-foundation/common_voice_17_0
- PolyAI/minds14
metrics:
- bleu
- cer
base_model:
- microsoft/Phi-4-multimodal-instruct
language:
- ko
license: mit
tags:
- korean
- stt
- custom_code
- phi
- phi-4-multimodal
model-index:
- name: Phi-4-mm-inst-zeroth-kor
  results:
  - task:
      type: speech-to-text-translation
    dataset:
      name: fleurs (ko-en test intersection)
      type: seastar105/fleurs_ko_en_test
    metrics:
    - type: bleu
      value: 7.03
      name: ko2en
    - type: bleu
      value: 7.04
      name: ko2en-cot
    - type: bleu
      value: 12.5
      name: en2ko (ko-mecab)
    - type: bleu
      value: 9.54
      name: en2ko-cot (ko-mecab)
  - task:
      type: automatic-speech-recognition
    dataset:
      name: zeroth_korean test
      type: kresnik/zeroth_korean
    metrics:
    - type: cer
      value: 7.02
      name: test CER
---

# Phi-4-multimodal-finetune-ko-speech

This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:

- kresnik/zeroth_korean
- mozilla-foundation/common_voice_17_0 (Used Korean speech only)
- PolyAI/minds14 (Used Korean speech only)
- Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)

Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.

The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)

Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.

Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.

## Evaluation

Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).

Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.

| Model                | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|----------------------|-------------|--------------|------------------|--------------|------------------|
| original             |  198.32     | 5.63         | 2.42             | 6.86         | 4.17             |
| finetune (4 epochs) |  2.72       | 7.11         | 9.95             | 13.22        | 10.45            |
| finetune (1 epoch) |  3.80       | 7.03         | 7.04             | 12.50        | 9.54             |
| Phi-4-mm-inst-zeroth-kor |  7.02       | 7.07         | 9.19             | 13.08        | 9.35             |

## Usage

### Requirements

Works with the following packages. Please make sure to install them before using the model.
```
flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.4.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.14.0
datasets==3.3.2
librosa==0.10.2.post1
pandas==2.2.3
```

### Sample code
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

max_new_tokens = 256
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    ft_model_path,
    trust_remote_code=True,
    torch_dtype='auto',
    _attn_implementation='flash_attention_2',
).cuda()

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'

asr_ds = load_dataset("kresnik/zeroth_korean", split="test")

# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] 
print(response) # "๋ชฌํ†  ํ‚ฌ์€ ์ž๋…€๋“ค์ด ์‚ฌ๋ž‘์„ ์ œ๋Œ€๋กœ ๋ชป ๋ฐ›๊ณ  ํฌ๋ฉด ๋งค์šฐ ์‹ฌ๊ฐํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์ดˆ๋ž˜๋œ๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ ธ์Šต๋‹ˆ๋‹ค"
```

### Demos
Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.

## References

- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
- https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor