Phoneme recognition

#86
by dg96 - opened

Is it possible to use whisper to output phoneme transcription instead of text transcription?

Hi @sanchit-gandhi
Thank you for pointing me towards discussions page.

If I understand it correctly, whisper currently cannot output phoneme transcription. However, there was one response that said one could train a whisper model with audio + phoneme transcriptions instead of the recommended audio + text transcriptions. Is this possible? Because for fine-tuning whisper with audio + phoneme transcriptions, I would be using pretrained feature extractor and tokenizer as per your blog https://huggingface.co/blog/fine-tune-whisper.
Please let me know your thoughts on this

Thanks!

Hey @dg96 - that's a cool proposition! I think we could fine-tune Whisper for phoneme transcriptions. The feature extractor can stay the same (we can pre-process the audio in the same way as before). We'd need to change the tokenizer to handle the new vocabulary. Namely, what we need to do is build a new tokenizer over the possible phonemes. For this, you can follow this guide: https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt

You should then have a tokenizer that you can load with HF Transformers:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(...)

Once we have built our new tokenizer, we need to make sure that the Whisper embedding layer has the same dimensionality as the number of tokens:

# new random embeddings for our phoneme tokens
model.resize_token_embeddings(len(tokenizer))

Once we've done that, the Whisper model will now be set to predict phonemes instead of sub-word tokens. You can then fine-tune the model on an (audio, phoneme) dataset in exactly the same way as the fine-tuning blog describes. You might want to change the compute_metrics function to a more applicable metric for phoneme prediction than WER.

I am not an expert on Whisper, but a related use case is needs timing data as well. For example, to control a 3D animated character's facial expressions, you need phonemes and timing data for the phoneme. Otherwise the lipsync can get out of alignment.

Quote: "Once we've done that, the Whisper model will now be set to predict phonemes instead of sub-word tokens." But, it still CAN, and will still compute sub-word token logits in output layer and will still TRY or be tempted to predict sub-word tokens (unless with prompt engineering you can deactivate sub-word prediction) Adding phoneme tokens to the vocabulary in addition to the original sub-word tokens, via "finetuning" etc., increases the computational load at the output layer, and dilutes the reliability of the softmax output. So, in this application, after adding phoneme tokens to the model vocabulary, set a "phoneme-output mode" switch flag to disable the logit computation of the original sub-word tokens (except for such of the subwords as are also desired phonemes), using my invention : Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs https://huggingface.co/MartialTerran/Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs

With this further enhancement (reducing the available Logit set during inference), you minimize the computational load at the output layer, and you optimize the reliability of the phoneme token logits computed and softmaxed for identification of the token. You are licensed to experiment with my invention Method_for_Dynamically_Reducing_Logit_Computation_in_LLMs for 30 Days starting when you sent me a Notice of your Intention to experiment with it.

Sign up or log in to comment