File size: 6,870 Bytes
4c63337 6accd21 0602162 6accd21 0602162 6accd21 0602162 6accd21 0602162 6accd21 0602162 6accd21 0602162 6accd21 0602162 6accd21 4c63337 0602162 a780aa8 4c63337 99edef4 4c63337 0602162 4c63337 59813d5 0602162 f44c7a9 0602162 57010c0 59813d5 0602162 f44c7a9 4a69ef7 f44c7a9 4a69ef7 f44c7a9 a28266d 0602162 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
datasets:
- kresnik/zeroth_korean
- mozilla-foundation/common_voice_17_0
- PolyAI/minds14
metrics:
- bleu
- cer
base_model:
- microsoft/Phi-4-multimodal-instruct
language:
- ko
license: mit
tags:
- korean
- stt
- custom_code
- phi
- phi-4-multimodal
model-index:
- name: Phi-4-mm-inst-zeroth-kor
results:
- task:
type: speech-to-text-translation
dataset:
name: fleurs (ko-en test intersection)
type: seastar105/fleurs_ko_en_test
metrics:
- type: bleu
value: 7.03
name: ko2en
- type: bleu
value: 7.04
name: ko2en-cot
- type: bleu
value: 12.5
name: en2ko (ko-mecab)
- type: bleu
value: 9.54
name: en2ko-cot (ko-mecab)
- task:
type: automatic-speech-recognition
dataset:
name: zeroth_korean test
type: kresnik/zeroth_korean
metrics:
- type: cer
value: 7.02
name: test CER
---
# Phi-4-multimodal-finetune-ko-speech
This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
- kresnik/zeroth_korean
- mozilla-foundation/common_voice_17_0 (Used Korean speech only)
- PolyAI/minds14 (Used Korean speech only)
- Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)
Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.
The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
## Evaluation
Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).
Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
| Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|----------------------|-------------|--------------|------------------|--------------|------------------|
| original | 198.32 | 5.63 | 2.42 | 6.86 | 4.17 |
| finetune (4 epochs) | 2.72 | 7.11 | 9.95 | 13.22 | 10.45 |
| finetune (1 epoch) | 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
| Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
## Usage
### Requirements
Works with the following packages. Please make sure to install them before using the model.
```
flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.4.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.14.0
datasets==3.3.2
librosa==0.10.2.post1
pandas==2.2.3
```
### Sample code
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
max_new_tokens = 256
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response) # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
```
### Demos
Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.
## References
- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
- https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor |