daekeun-ml
/

Phi-4-multimodal-finetune-ko-speech

@@ -49,7 +49,6 @@ tags:
 - phi-4-multimodal
 ---
 # Phi-4-multimodal-finetune-ko-speech
 This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
@@ -75,7 +74,7 @@ Evaluation was done on the following datasets:
 Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
-Compared to [this fine-tuned model](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
 | Model                | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
 |----------------------|-------------|--------------|------------------|--------------|------------------|
@@ -83,6 +82,70 @@ Compared to [this fine-tuned model](https://huggingface.co/seastar105/Phi-4-mm-i
 | finetune (this model)|  3.80       | 7.03         | 7.04             | 12.50        | 9.54             |
 | Phi-4-mm-inst-zeroth-kor |  7.02       | 7.07         | 9.19             | 13.08        | 9.35             |
 ## References
 - https://huggingface.co/microsoft/Phi-4-multimodal-instruct

 - phi-4-multimodal
 ---
 # Phi-4-multimodal-finetune-ko-speech
 This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
 Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
+Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
 | Model                | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
 |----------------------|-------------|--------------|------------------|--------------|------------------|
 | finetune (this model)|  3.80       | 7.03         | 7.04             | 12.50        | 9.54             |
 | Phi-4-mm-inst-zeroth-kor |  7.02       | 7.07         | 9.19             | 13.08        | 9.35             |
+## Usage
+### Requirements
+Works with the following packages. Please make sure to install them before using the model.
+```
+flash_attn==2.7.4.post1
+torch==2.6.0
+transformers==4.48.2
+accelerate==1.3.0
+soundfile==0.13.1
+pillow==11.1.0
+scipy==1.15.2
+torchvision==0.21.0
+backoff==2.2.1
+peft==0.13.2
+```
+### Sample code
+```python
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
+max_new_tokens = 256
+orig_model_path = "microsoft/Phi-4-multimodal-instruct"
+ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
+generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
+processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    ft_model_path,
+    trust_remote_code=True,
+    torch_dtype='auto',
+    _attn_implementation='flash_attention_2',
+).cuda()
+user_prompt = '<|user|>'
+assistant_prompt = '<|assistant|>'
+prompt_suffix = '<|end|>'
+# task prompt is from technical report
+asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
+ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
+ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
+ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
+ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
+asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
+# ASR
+item = asr_ds[0]
+audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
+inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
+generate_ids = model.generate(
+    **inputs,
+    max_new_tokens=max_new_tokens,
+    generation_config=generation_config,
+)
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
+response = processor.batch_decode(
+    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(response) # "몬토 킬은 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다"
+```
 ## References
 - https://huggingface.co/microsoft/Phi-4-multimodal-instruct