Speech recognition broken down by speakers

#167
by tur0kmag - opened

Is it possible to perform speech recognition broken down by speakers?

tur0kmag changed discussion title from peech recognition broken down by speakers to Speech recognition broken down by speakers

Hi, Do you mean speaker diarization?

I mean, as a result of recognition, I need to get the following:
Speaker 1: Hello. How are you?
Speaker 2: Everything is fine!

I think you should look in to this model https://huggingface.co/pyannote/speaker-diarization.
Because I don't think whisper models alone can be used for this.
https://medium.com/@xriteshsharmax/speaker-diarization-using-whisper-asr-and-pyannote-f0141c85d59a check out this blog.

Thanks for bringing this up! Speaker diarization is actually core to how we handle meeting transcription for teams.

@tur0kmag - The approach mentioned by @andromeda01111 using pyannote with Whisper is solid. I've found that combining them gives you the best of both worlds: Whisper's exceptional transcription accuracy with pyannote's speaker separation capabilities.

A few practical tips from my experience implementing this in organizational settings:

  1. Pre-processing audio makes a huge difference - normalizing volume levels between speakers can significantly improve diarization accuracy when some participants are quieter than others

  2. For meetings with known participants, you can improve accuracy by creating speaker "embeddings" (voice prints) beforehand if your use case allows for it

  3. Be mindful that overlapping speech remains challenging - in meetings where people frequently talk over each other, you might want to add a visual cue in transcripts where overlap is detected

  4. For recurring meetings with the same participants, you can potentially improve speaker identification by leveraging previous meetings' data

Has anyone here tried this in production environments with real meeting recordings? I'm particularly interested in how others handle the trade-off between real-time processing vs accuracy.

I briefly tried it by referring to above mentioned code. When I tried it with podcast audio the results were promising. But tried it with call recordings without any pre-processing. and the results were really bad. Now working on fine-tuning whisper large v3 with 3 languages. Planning to get back to this after i am done.

@ClarityAI Thank you for the tips. Will try all these tips when i get back to this project.

Sign up or log in to comment