sanchit-gandhi
/

whisper-medium-switchboard-5k

Model card Files Files and versions Community

Model fine-tuner?

by mezaros - opened Oct 15, 2022

Discussion

mezaros

Oct 15, 2022

•

edited Oct 15, 2022

@sanchit-gandhi Hello! Are you actively working on this? I'm following along eagerly, as I desperately need to fine-tune the large model to always indicate when the speaker turns. (You can get this somewhat working with a modified prompt, but it readily fails.)

sanchit-gandhi

Owner Oct 16, 2022

Hey! I've opened an active PR on Transformers for ASR fine-tuning: https://github.com/huggingface/transformers/pull/19519

Expect a working script and blog post on the topic next week 🤗

daniel-v-e

Jan 5, 2023

@sanchit-gandhi Hello! Are you actively working on this? I'm following along eagerly, as I desperately need to fine-tune the large model to always indicate when the speaker turns. (You can get this somewhat working with a modified prompt, but it readily fails.)

@mezaros Do I understand correctly that you want Whisper to perform speaker diarization? And, have you managed to make it work?

sanchit-gandhi

Owner Jan 5, 2023

Update: blog post and utils for fine-tuning are both readily available now! 🤗

daniel-v-e

Jan 5, 2023

@sanchit-gandhi Brilliant, thanks a lot! Just out of interest, is it somehow possible (is it even a good idea?) to modify the final layers of whisper to perform classification for example, instead of transcription? Essentially that the internal representation produced by whisper feeds into the final classification layers, and that the whole thing is trainable / fine-tunable?

sanchit-gandhi

Owner Jan 13, 2023

•

edited Jan 13, 2023

Hey @daniel-v-e ! Sorry for the late reply here. It's for sure possible to modify Whisper to be used for audio classification tasks! You can add a sequence classification layer / head on top of the base model to generate a single class prediction. Refer to MBartForSequenceClassification to see how we achieve this for the MBART model. The same principle here applies to the Whisper model. IMO this approach should work - it'll just require fine-tuning with correctly formatted data for audio classification.

daniel-v-e

Jan 13, 2023

Sounds good, thank you @sanchit-gandhi ! There isn't maybe somewhere a similar example for multi-class classification? Or is the extension to multiple classes straightforward?

sanchit-gandhi

Owner Jan 16, 2023

•

edited Jan 16, 2023

Hey @daniel-v-e ! You simply need to specify num_labels=... to .from_pretrained. The modelling code will take care of the rest (c.f. https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartForSequenceClassification.forward.example-2 and https://github.com/huggingface/transformers/blob/b210c83a78022226ce48402cd67d8c8da7afbd8d/src/transformers/models/mbart/modeling_mbart.py#L1499)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment