Update README.md
Browse files
README.md
CHANGED
@@ -49,7 +49,6 @@ tags:
|
|
49 |
- phi-4-multimodal
|
50 |
---
|
51 |
|
52 |
-
|
53 |
# Phi-4-multimodal-finetune-ko-speech
|
54 |
|
55 |
This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
|
@@ -75,7 +74,7 @@ Evaluation was done on the following datasets:
|
|
75 |
|
76 |
Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
|
77 |
|
78 |
-
Compared to [
|
79 |
|
80 |
| Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|
81 |
|----------------------|-------------|--------------|------------------|--------------|------------------|
|
@@ -83,6 +82,70 @@ Compared to [this fine-tuned model](https://huggingface.co/seastar105/Phi-4-mm-i
|
|
83 |
| finetune (this model)| 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
|
84 |
| Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
|
85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
86 |
## References
|
87 |
|
88 |
- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
|
|
|
49 |
- phi-4-multimodal
|
50 |
---
|
51 |
|
|
|
52 |
# Phi-4-multimodal-finetune-ko-speech
|
53 |
|
54 |
This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
|
|
|
74 |
|
75 |
Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
|
76 |
|
77 |
+
Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
|
78 |
|
79 |
| Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|
80 |
|----------------------|-------------|--------------|------------------|--------------|------------------|
|
|
|
82 |
| finetune (this model)| 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
|
83 |
| Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
|
84 |
|
85 |
+
## Usage
|
86 |
+
|
87 |
+
### Requirements
|
88 |
+
|
89 |
+
Works with the following packages. Please make sure to install them before using the model.
|
90 |
+
```
|
91 |
+
flash_attn==2.7.4.post1
|
92 |
+
torch==2.6.0
|
93 |
+
transformers==4.48.2
|
94 |
+
accelerate==1.3.0
|
95 |
+
soundfile==0.13.1
|
96 |
+
pillow==11.1.0
|
97 |
+
scipy==1.15.2
|
98 |
+
torchvision==0.21.0
|
99 |
+
backoff==2.2.1
|
100 |
+
peft==0.13.2
|
101 |
+
```
|
102 |
+
|
103 |
+
### Sample code
|
104 |
+
```python
|
105 |
+
from datasets import load_dataset
|
106 |
+
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
|
107 |
+
|
108 |
+
max_new_tokens = 256
|
109 |
+
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
|
110 |
+
ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
|
111 |
+
generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
|
112 |
+
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
|
113 |
+
model = AutoModelForCausalLM.from_pretrained(
|
114 |
+
ft_model_path,
|
115 |
+
trust_remote_code=True,
|
116 |
+
torch_dtype='auto',
|
117 |
+
_attn_implementation='flash_attention_2',
|
118 |
+
).cuda()
|
119 |
+
|
120 |
+
user_prompt = '<|user|>'
|
121 |
+
assistant_prompt = '<|assistant|>'
|
122 |
+
prompt_suffix = '<|end|>'
|
123 |
+
|
124 |
+
# task prompt is from technical report
|
125 |
+
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
|
126 |
+
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
|
127 |
+
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
|
128 |
+
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
|
129 |
+
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
|
130 |
+
|
131 |
+
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
|
132 |
+
|
133 |
+
# ASR
|
134 |
+
item = asr_ds[0]
|
135 |
+
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
|
136 |
+
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
|
137 |
+
generate_ids = model.generate(
|
138 |
+
**inputs,
|
139 |
+
max_new_tokens=max_new_tokens,
|
140 |
+
generation_config=generation_config,
|
141 |
+
)
|
142 |
+
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
|
143 |
+
response = processor.batch_decode(
|
144 |
+
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
145 |
+
)[0]
|
146 |
+
print(response) # "몬토 킬은 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다"
|
147 |
+
```
|
148 |
+
|
149 |
## References
|
150 |
|
151 |
- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
|