Whisper large v3 can not recognize speech after Fine-Tune

#51
by bardenthenry - opened

Before Fine-Tune

[0.00s > 18.94s] 大家報告一下上週的進度
[19.20s > 21.50s] 上週主要在PPC
[21.76s > 26.12s] AVAP這邊是用那個AML模型建立生存的
[26.12s > 29.20s] 預測模型來看它的效果
[29.44s > 31.24s] 那一開始就是如上週報告
[31.50s > 34.56s] 有測試的就是不同初始值會對模型的影響
[34.82s > 36.10s] 這邊是使用同一個
[36.36s > 37.90s] 深度的模型來測試
[38.40s > 41.22s] 那測試的結果是明顯的
[41.48s > 45.06s] 初始的權重會對模型的表現性影響很大
[45.58s > 46.86s] 那這邊
[47.12s > 49.68s] 分別就是使用了三種不同的初始權重
[50.18s > 54.54s] 那他們雖然在同一個架構一層的hidden layer下面
[54.80s > 55.30s] 他們的
[55.56s > 56.08s] 表現性
[56.38s > 57.66s] 還是有明顯的不同
[58.94s > 60.48s] 等於說這是什麼專案
[60.72s > 61.76s] 這個是
[62.00s > 64.06s] 這邊是用
[64.56s > 65.84s] PVTC的
[66.10s > 66.62s] 數據
[67.12s > 69.44s] 你現在在研究的這個專案是哪一個
[69.68s > 70.46s] 現在
[70.96s > 72.24s] 這個的專案就是
[72.76s > 77.12s] 因為PVTC跟VAEP都要想要用CVAE的生成方式
[77.62s > 83.26s] 但是因為CVAE那邊生成的數據還是需要一個模型去驗證
[83.52s > 85.04s] 出來它的數據預測準不準
[85.56s > 86.08s] 那目前就是
[86.32s > 87.92s] 生成這邊就先放置然後來
......

After Fine-Tune

[21.76s > 25.86s] ,
[25.86s > 55.86s] 的預測模型來看它的效果 那測試的結果是 明顯的初始的權重會對模型的表現性影響很大 那這邊分別就是使用的三種不同的初始權重 那他們雖戾一層的械類的下面 他們的表現

Training Data format

# Input labels
[ 50258, 50260, 50359, 50363, 25583, 5000, 13331, 252, 4511, 5884, 44, 25729, 27735, 50257 ]
# decode input labels
'<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>還是他塞到我們MongoDB<|endoftext|>'

library version

numba
numpy>=1.23.1
soundfile>=0.12.1
librosa>=0.10.0
dataclasses>=0.6
transformers>=4.35.0
bitsandbytes>=0.41.0
datasets>=2.11.0
evaluate>=0.4.0
ctranslate2>=3.21.0
faster-whisper>=0.10.0
jiwer>=2.5.1
peft>=0.6.2
accelerate>=0.21.0
zhconv>=1.4.2
tqdm>=4.62.1
soundcard>=0.4.2
uvicorn>=0.21.1
fastapi>=0.95.1
starlette>=0.26.1
tensorboardX>=2.2
tiktoken==0.3.3
openai-whisper>=20231117
notebook==6.5.4
jupyterlab==4.0.2
pydub>=0.25.1
openpyxl>=3.1.2
setuptools-rust
more-itertools

Reference Process

https://huggingface.co/blog/fine-tune-whisper

This fine-tune process is ok for fine-tuning whisper-large-v2, but not ok for large-v3. I don't know what's wrong?

bardenthenry changed discussion title from Whisper large v3 becomes can not recognize speech after Fine-Tune to Whisper large v3 can not recognize speech after Fine-Tune

PEFT is also reported to be unstable.
https://github.com/huggingface/peft/issues/1223

any follow up ? do you still face this issue ? did you solve it ? ( and how? )

This issue still continues to bother me, so I have to continue using the whisper-large-v2 to finetune.

@bardenthenry how did you manage the GPU memory to finetune largeV3, I am using three GPU with 22G memory for each to finetune largeV3, even batch number is set to 1, still facing CUDA Our of memory issue

@bardenthenry how did you manage the GPU memory to finetune largeV3, I am using three GPU with 22G memory for each to finetune largeV3, even batch number is set to 1, still facing CUDA Our of memory issue

@lanejohn Maybe reduce lora layers?

Sign up or log in to comment