Spaces:

openai
/

whisper

Running on L4

App Files Files Community

128

About filler words detection (a.k.a. full breaks / disfluency words e.g. eh, umm, ahh)

#30

by rmajasol - opened Oct 12, 2022

Discussion

rmajasol

Oct 12, 2022

•

edited Oct 12, 2022

Hi, i've tested the model pronouncing some of that filler sounds (eh, umm, ahh, among others), but none were detected in the text. Is there some parameter to adjust that, or is it necessary to "fine tune" model training with new datasets containing that sounds?

Thank you!

chan-K

Oct 13, 2022

I agree with that. So, some of filler words, stammerted words and repetition of same words weren't detected in the text.
I think It is caused for making much better context by model itself.

JWNoctis

Oct 14, 2022

Whisper paper, "Robust Speech Recognition via Large-Scale Weak Supervision", page 21, Appendix C, Text Standardization:

"...We perform the following steps to normalize English texts in different styles into a standardized form, which is a best-effort attempt to penalize only when a word error is caused by actually mistranscribing a word, and not by formatting or punctuation differences.

[Two entries omitted]

Remove any of the following words: hmm, mm, mhm, mmm, uh, um"

In other words, those are effectively filtered out.

SmartCarrion

Feb 7, 2023

Agreed, but is there anyway to add them back in? A flag for it would be useful when I’m using the disfluency timing to edit the audio or associated video.

rmajasol

Feb 8, 2023

•

edited Feb 8, 2023

Agreed, but is there anyway to add them back in? A flag for it would be useful when I’m using the disfluency timing to edit the audio or associated video.

Yes, it would be very useful for cases such as giving feedback about different spoken audio aspects that could be detected by whisper. Specially for educational purposes such as training speaking skills and getting feedback about what filler words were pronounced.

sanchit-gandhi

Feb 10, 2023

•

edited Feb 10, 2023

If you're using the model + processor, you can set normalize=False in the processor to skip the entire text normalisation step:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text with normalisation
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)
print(transcription)
# decode token ids to text without normalisation
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)
print(transcription)

Print Output:

['mister quilter is the apostle of the middle classes and we are glad to welcome his gospel']
[' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.']

lukethehecker

May 25, 2023

Here's a little trick: I prompted whisper with "So uhm, yeaah. Okay, ehm, uuuh."

That caused it to transcribe these fill words, at least occasionally. Just for reference, I am using this whisper implementation: https://github.com/guillaumekln/faster-whisper with the "tiny.en" model.

Laurin-myreha

Sep 9, 2024

checkout this model which was specifically designed with filler detection in mind:
https://huggingface.co/nyrahealth/CrisperWhisper

and the accompanying repo and paper:
https://github.com/nyrahealth/CrisperWhisper

achrafml23

Oct 22, 2024

You can now detect them using the parameter initial_prompt= "Umm, let me think like, hmm... Okay, here's what I'm, like, thinking.".
You can see it in the examples here https://platform.openai.com/docs/guides/speech-to-text/prompting

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment