model failing to transcribe but working fine for translation
I'm working with 8000hz frequency audio, and I've observed strange behavior with the Whisper model for translation and transcription. I'm using the Hugging Face pipeline with the Whisper large v2 model, and it's working well for translation, but for transcription, it's repeating the same word in the whole output. I've tried converting the audio to 16000hz and normalizing it, but I'm still getting the same results.
Do you have a reproducible code snippet for this
@atulyaatul
? Would be happy to take a look! Otherwise an easy thing to try is decoding with timestamps (pass return_timestamps=True
), which often reduces hallucinations. If inference speed is less of a consideration, you can also activate beam search by passing generate_kwargs={"num_beams": 2}
to the pipeline