Long Audio Files

#7
by eikosa - opened

How can I process long audio recordings with this model? (1-60 min)

You can use pipeline as per the demo at: https://huggingface.co/spaces/sanchit-gandhi/whisper-large-v2

This will enable you to transcribe files of up to arbitrary length:

from transformers import pipeline

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v2",
    chunk_length_s=30,
    device=device,
)

out = pipe(audio)["text"]

where audio is the path to an audio file or a loaded audio array (see https://github.com/huggingface/transformers/blob/c1b9a11dd4be8af32b3274be7c9774d5a917c56d/src/transformers/pipelines/automatic_speech_recognition.py#L201)

You can use pipeline as per the demo at: https://huggingface.co/spaces/sanchit-gandhi/whisper-large-v2

This will enable you to transcribe files of up to arbitrary length:

from transformers import pipeline

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v2",
    chunk_length_s=30,
    device=device,
)

out = pipe(audio)["text"]

where audio is the path to an audio file or a loaded audio array (see https://github.com/huggingface/transformers/blob/c1b9a11dd4be8af32b3274be7c9774d5a917c56d/src/transformers/pipelines/automatic_speech_recognition.py#L201)

how can i set output language with this method

Hey @eikosa ! Just make sure you've installed Transformers from main:

pip install git+https://github.com/huggingface/transformers

Then you can change the language as follows:

from transformers import pipeline

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v2",
    chunk_length_s=30,
    device=device,
)

# change language as required
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="Spanish", task="transcribe")

out = pipe(audio)["text"]

Hi @sanchit-gandhi
If we use the pipeline as mentioned in https://huggingface.co/openai/whisper-large-v2 instead of the one which you had specified above, transcription stops after 30 second mark. Can this be solved?

Hey @kirankumaram , sorry for the late reply! You just need to specify chunk_length_s=30 when you instantiate the pipeline:

from transformers import pipeline

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v2",
    chunk_length_s=30,  # <- this arg lets us do long form transcriptions!
    device=device,
)

# load your audio as required
audio = ...

# inference
out = pipe(audio)["text"]
print(out)

Sign up or log in to comment