|
--- |
|
datasets: |
|
- kresnik/zeroth_korean |
|
- mozilla-foundation/common_voice_17_0 |
|
- PolyAI/minds14 |
|
metrics: |
|
- bleu |
|
- cer |
|
base_model: |
|
- microsoft/Phi-4-multimodal-instruct |
|
language: |
|
- ko |
|
license: mit |
|
tags: |
|
- korean |
|
- stt |
|
- custom_code |
|
- phi |
|
- phi-4-multimodal |
|
model-index: |
|
- name: Phi-4-mm-inst-zeroth-kor |
|
results: |
|
- task: |
|
type: speech-to-text-translation |
|
dataset: |
|
name: fleurs (ko-en test intersection) |
|
type: seastar105/fleurs_ko_en_test |
|
metrics: |
|
- type: bleu |
|
value: 7.03 |
|
name: ko2en |
|
- type: bleu |
|
value: 7.04 |
|
name: ko2en-cot |
|
- type: bleu |
|
value: 12.5 |
|
name: en2ko (ko-mecab) |
|
- type: bleu |
|
value: 9.54 |
|
name: en2ko-cot (ko-mecab) |
|
- task: |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: zeroth_korean test |
|
type: kresnik/zeroth_korean |
|
metrics: |
|
- type: cer |
|
value: 7.02 |
|
name: test CER |
|
--- |
|
|
|
# Phi-4-multimodal-finetune-ko-speech |
|
|
|
This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets: |
|
|
|
- kresnik/zeroth_korean |
|
- mozilla-foundation/common_voice_17_0 (Used Korean speech only) |
|
- PolyAI/minds14 (Used Korean speech only) |
|
- Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py) |
|
|
|
Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz. |
|
|
|
The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) |
|
|
|
Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed. |
|
|
|
Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point. |
|
|
|
## Evaluation |
|
|
|
Evaluation was done on the following datasets: |
|
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples). |
|
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples). |
|
|
|
Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py). |
|
|
|
Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting. |
|
|
|
| Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot | |
|
|----------------------|-------------|--------------|------------------|--------------|------------------| |
|
| original | 198.32 | 5.63 | 2.42 | 6.86 | 4.17 | |
|
| finetune (4 epochs) | 2.72 | 7.11 | 9.95 | 13.22 | 10.45 | |
|
| finetune (1 epoch) | 3.80 | 7.03 | 7.04 | 12.50 | 9.54 | |
|
| Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 | |
|
|
|
## Usage |
|
|
|
### Requirements |
|
|
|
Works with the following packages. Please make sure to install them before using the model. |
|
``` |
|
flash_attn==2.7.4.post1 |
|
torch==2.6.0 |
|
transformers==4.48.2 |
|
accelerate==1.4.0 |
|
soundfile==0.13.1 |
|
pillow==11.1.0 |
|
scipy==1.15.2 |
|
torchvision==0.21.0 |
|
backoff==2.2.1 |
|
peft==0.14.0 |
|
datasets==3.3.2 |
|
librosa==0.10.2.post1 |
|
pandas==2.2.3 |
|
``` |
|
|
|
### Sample code |
|
```python |
|
from datasets import load_dataset |
|
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig |
|
|
|
max_new_tokens = 256 |
|
orig_model_path = "microsoft/Phi-4-multimodal-instruct" |
|
ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech" |
|
generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json') |
|
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
ft_model_path, |
|
trust_remote_code=True, |
|
torch_dtype='auto', |
|
_attn_implementation='flash_attention_2', |
|
).cuda() |
|
|
|
user_prompt = '<|user|>' |
|
assistant_prompt = '<|assistant|>' |
|
prompt_suffix = '<|end|>' |
|
|
|
# task prompt is from technical report |
|
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}' |
|
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}' |
|
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}' |
|
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}' |
|
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}' |
|
|
|
asr_ds = load_dataset("kresnik/zeroth_korean", split="test") |
|
|
|
# ASR |
|
item = asr_ds[0] |
|
audio = (item["audio"]["array"], item["audio"]["sampling_rate"]) |
|
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device) |
|
generate_ids = model.generate( |
|
**inputs, |
|
max_new_tokens=max_new_tokens, |
|
generation_config=generation_config, |
|
) |
|
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] |
|
response = processor.batch_decode( |
|
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
)[0] |
|
print(response) # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค" |
|
``` |
|
|
|
### Demos |
|
Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly. |
|
|
|
## References |
|
|
|
- https://huggingface.co/microsoft/Phi-4-multimodal-instruct |
|
- https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor |