Audio transcribing, timestamping for whole sentences.

#16
by artyomboyko - opened

Good afternoon. Is there any way to generate audio transcoding without breaking sentences.

For example, when transcribing a video get instead of:

00:00:08,960 --> 00:00:13,840 This video is an introductory video about coders, decoders and codecs.
00:00:13,840 --> 00:00:18,640. In this episode we try to understand what a transformer network is all about,
00:00:18,640 --> 00:00:24,720 and try to explain it in simple, high-level terms. 

The following:

00:00:08,960 --> 00:00:18,640 This video is an introductory video to a series of videos about coders, decoders, and coder decoders.
00:00:18,640 --> 00:00:24,720 In this series we will try to understand what a transformer network is and try to explain it in simple, high-level terms.

???

Hey @ElectricSecretAgent! Could you simply piece together the transcriptions and take the first/last timestamps?

import torch
from transformers import pipeline
from datasets import load_dataset

model = "openai/whisper-tiny"
device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model=model,
    chunk_length_s=30,
    device=device,
)

# replace this with the loading/inference for your audio sample
ls_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
out = pipe(ls_dummy[0]["audio"], return_timestamps=True)

# join all the text together
text = [chunk["text"] for chunk in out["chunks"]]
text = "".join(text)

# get first timestamp of first chunk
start = out["chunks"][0]["timestamp"][0]
# get last timestamp of last chunk
end = out["chunks"][-1]["timestamp"][-1]

print(f"{start} -> {end}: {text}")

Print output:

0.0 -> 5.44:  Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.

Thanks. I test it.

See ACICFG's implementation(with VAD, forced alignment and translation pipeline): https://colab.research.google.com/github/cnbeining/Whisper_Notebook/blob/master/WhisperX.ipynb

You can also set batch_size=... in the transformers implementation to speed-up transcription for long audio samples:

out = pipe(ls_dummy[0]["audio"], return_timestamps=True, batch_size=4)

Sign up or log in to comment