Sanchit Gandhi

sanchit-gandhi

AI & ML interests

Open-Source Speech

Articles

Organizations

Hugging Face's profile picture Hugging Face Internal Testing Organization's profile picture ESPnet's profile picture XTREME-S's profile picture Whisper fine-tuning sprint's profile picture Hugging Face Course's profile picture Centre for Vision, Speech and Signal Processing - University of Surrey's profile picture Whisper Fine-Tuning Event's profile picture Kensho's profile picture Coqui.ai's profile picture Internal Data & Models for Speech Recognition Event's profile picture Speech Recognition Community Event Version 2's profile picture Speech Seq2Seq Experiments's profile picture HF Course Demos's profile picture Speechbox's profile picture SpeechColab's profile picture Linguistic Data Consortium's profile picture Internal Data's profile picture HuggingFaceM4's profile picture HF Canonical Model Maintainers's profile picture Whisper Distillation's profile picture Hugging Face OSS Metrics's profile picture University of Edingburgh - Centre For Speech Technology Research's profile picture ESC Benchmark's profile picture End-to-End Speech Benchmark's profile picture meta-private's profile picture Music Gen Sprint's profile picture Kakao Enterprise's profile picture Hugging Face for Audio's profile picture USCD REACH's profile picture TTS Eval (OLD)'s profile picture gg-hf's profile picture diarizers-community's profile picture Hugging Face Assignments's profile picture TTS AGI's profile picture Parler TTS's profile picture Sweet Dream(Booth)s's profile picture gg-tt's profile picture

Posts 1

view post
Post
Why does returning timestamps help Whisper reduce hallucinations? 🧐

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:
The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:
<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:
<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?