Update README.md

99edef4 verified about 5 hours ago

6.87 kB

	---
	datasets:
	- kresnik/zeroth_korean
	- mozilla-foundation/common_voice_17_0
	- PolyAI/minds14
	metrics:
	- bleu
	- cer
	base_model:
	- microsoft/Phi-4-multimodal-instruct
	language:
	- ko
	license: mit
	tags:
	- korean
	- stt
	- custom_code
	- phi
	- phi-4-multimodal
	model-index:
	- name: Phi-4-mm-inst-zeroth-kor
	results:
	- task:
	type: speech-to-text-translation
	dataset:
	name: fleurs (ko-en test intersection)
	type: seastar105/fleurs_ko_en_test
	metrics:
	- type: bleu
	value: 7.03
	name: ko2en
	- type: bleu
	value: 7.04
	name: ko2en-cot
	- type: bleu
	value: 12.5
	name: en2ko (ko-mecab)
	- type: bleu
	value: 9.54
	name: en2ko-cot (ko-mecab)
	- task:
	type: automatic-speech-recognition
	dataset:
	name: zeroth_korean test
	type: kresnik/zeroth_korean
	metrics:
	- type: cer
	value: 7.02
	name: test CER
	---

	# Phi-4-multimodal-finetune-ko-speech

	This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:

	- kresnik/zeroth_korean
	- mozilla-foundation/common_voice_17_0 (Used Korean speech only)
	- PolyAI/minds14 (Used Korean speech only)
	- Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)

	Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.

	The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)

	Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.

	Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.

	## Evaluation

	Evaluation was done on the following datasets:
	- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
	- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

	Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).

	Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.

	\| Model \| zeroth-test \| fleurs-ko2en \| fleurs-ko2en-cot \| fleurs-en2ko \| fleurs-en2ko-cot \|
	\|----------------------\|-------------\|--------------\|------------------\|--------------\|------------------\|
	\| original \| 198.32 \| 5.63 \| 2.42 \| 6.86 \| 4.17 \|
	\| finetune (4 epochs) \| 2.72 \| 7.11 \| 9.95 \| 13.22 \| 10.45 \|
	\| finetune (1 epoch) \| 3.80 \| 7.03 \| 7.04 \| 12.50 \| 9.54 \|
	\| Phi-4-mm-inst-zeroth-kor \| 7.02 \| 7.07 \| 9.19 \| 13.08 \| 9.35 \|

	## Usage

	### Requirements

	Works with the following packages. Please make sure to install them before using the model.
	```
	flash_attn==2.7.4.post1
	torch==2.6.0
	transformers==4.48.2
	accelerate==1.4.0
	soundfile==0.13.1
	pillow==11.1.0
	scipy==1.15.2
	torchvision==0.21.0
	backoff==2.2.1
	peft==0.14.0
	datasets==3.3.2
	librosa==0.10.2.post1
	pandas==2.2.3
	```

	### Sample code
	```python
	from datasets import load_dataset
	from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

	max_new_tokens = 256
	orig_model_path = "microsoft/Phi-4-multimodal-instruct"
	ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
	generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
	processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	ft_model_path,
	trust_remote_code=True,
	torch_dtype='auto',
	_attn_implementation='flash_attention_2',
	).cuda()

	user_prompt = '<\|user\|>'
	assistant_prompt = '<\|assistant\|>'
	prompt_suffix = '<\|end\|>'

	# task prompt is from technical report
	asr_prompt = f'{user_prompt}<\|audio_1\|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
	ast_ko_prompt = f'{user_prompt}<\|audio_1\|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
	ast_cot_ko_prompt = f'{user_prompt}<\|audio_1\|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
	ast_en_prompt = f'{user_prompt}<\|audio_1\|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
	ast_cot_en_prompt = f'{user_prompt}<\|audio_1\|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'

	asr_ds = load_dataset("kresnik/zeroth_korean", split="test")

	# ASR
	item = asr_ds[0]
	audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
	inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
	generate_ids = model.generate(
	**inputs,
	max_new_tokens=max_new_tokens,
	generation_config=generation_config,
	)
	generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
	response = processor.batch_decode(
	generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]
	print(response) # "몬토 킬은 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다"
	```

	### Demos
	Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.

	## References

	- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
	- https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor