Speech recognition broken down by speakers

#167

by tur0kmag - opened Oct 28, 2024

Oct 28, 2024

Is it possible to perform speech recognition broken down by speakers?

tur0kmag changed discussion title from peech recognition broken down by speakers to Speech recognition broken down by speakers Oct 28, 2024

andromeda01111

Oct 28, 2024

Hi, Do you mean speaker diarization?

tur0kmag

Oct 28, 2024

I mean, as a result of recognition, I need to get the following:
Speaker 1: Hello. How are you?
Speaker 2: Everything is fine!

andromeda01111

Oct 28, 2024

•

edited Oct 28, 2024

I think you should look in to this model https://huggingface.co/pyannote/speaker-diarization.
Because I don't think whisper models alone can be used for this.
https://medium.com/@xriteshsharmax/speaker-diarization-using-whisper-asr-and-pyannote-f0141c85d59a check out this blog.

ClarityAI

11 days ago

Thanks for bringing this up! Speaker diarization is actually core to how we handle meeting transcription for teams.

@tur0kmag - The approach mentioned by @andromeda01111 using pyannote with Whisper is solid. I've found that combining them gives you the best of both worlds: Whisper's exceptional transcription accuracy with pyannote's speaker separation capabilities.

A few practical tips from my experience implementing this in organizational settings:

Pre-processing audio makes a huge difference - normalizing volume levels between speakers can significantly improve diarization accuracy when some participants are quieter than others
For meetings with known participants, you can improve accuracy by creating speaker "embeddings" (voice prints) beforehand if your use case allows for it
Be mindful that overlapping speech remains challenging - in meetings where people frequently talk over each other, you might want to add a visual cue in transcripts where overlap is detected
For recurring meetings with the same participants, you can potentially improve speaker identification by leveraging previous meetings' data

Has anyone here tried this in production environments with real meeting recordings? I'm particularly interested in how others handle the trade-off between real-time processing vs accuracy.

andromeda01111

8 days ago

I briefly tried it by referring to above mentioned code. When I tried it with podcast audio the results were promising. But tried it with call recordings without any pre-processing. and the results were really bad. Now working on fine-tuning whisper large v3 with 3 languages. Planning to get back to this after i am done.

@ClarityAI Thank you for the tips. Will try all these tips when i get back to this project.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment