model failing to transcribe but working fine for translation

#70

by atulyaatul - opened Sep 28, 2023

Sep 28, 2023

I'm working with 8000hz frequency audio, and I've observed strange behavior with the Whisper model for translation and transcription. I'm using the Hugging Face pipeline with the Whisper large v2 model, and it's working well for translation, but for transcription, it's repeating the same word in the whole output. I've tried converting the audio to 16000hz and normalizing it, but I'm still getting the same results.

sanchit-gandhi

Sep 28, 2023

Do you have a reproducible code snippet for this @atulyaatul ? Would be happy to take a look! Otherwise an easy thing to try is decoding with timestamps (pass return_timestamps=True), which often reduces hallucinations. If inference speed is less of a consideration, you can also activate beam search by passing generate_kwargs={"num_beams": 2} to the pipeline

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment