daekeun-ml commited on
Commit
4c63337
·
verified ·
1 Parent(s): 39172c2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - kresnik/zeroth_korean
4
+ - mozilla-foundation/common_voice_17_0
5
+ - PolyAI/minds14
6
+ metrics:
7
+ - bleu
8
+ - cer
9
+ base_model:
10
+ - microsoft/Phi-4-multimodal-instruct
11
+ language:
12
+ - ko
13
+ license: mit
14
+ tags:
15
+ - korean
16
+ - stt
17
+ - custom_code
18
+ - phi
19
+ - phi-4-multimodal
20
+ ---
21
+
22
+ # Phi-4-multimodal-finetune-ko-speech
23
+
24
+ This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
25
+
26
+ - kresnik/zeroth_korean
27
+ - mozilla-foundation/common_voice_17_0
28
+ - PolyAI/minds14
29
+ - Custom dataset on my own (Recorded Korean speech sentences and transcribed using Azure Speech-to-text API). The speech was a mix of fast and slow speech, with some modulation using [audiomentations](https://github.com/iver56/audiomentations).
30
+
31
+ Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.
32
+
33
+ The model was trained on a single A100 80GB GPU for 1 epoch with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
34
+
35
+ Note that this model is just a PoC/experimental purpose, and not intended to be used in production.
36
+ Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
37
+
38
+ ## Evaluation
39
+
40
+ ASR (Automatic Speech Recognition) on zeroth-test set and Speech translation on fleurs ko <-> en speech translation result. Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).