return_timestamps error

#28
by pearlyu - opened

When using the pipeline to get transcription with timestamps, it's alright for some audio files, but for some of the files it returns the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-8cc132230b9b> in <module>
----> 1 prediction = pipe(dataset[0], return_timestamps=True)["chunks"]

4 frames
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/automatic_speech_recognition.py in _find_timestamp_sequence(sequences, tokenizer, feature_extractor, max_source_positions)
    104         sequence = sequence.squeeze(0)
    105         # get rid of the `forced_decoder_idx` that are use to parametrize the generation
--> 106         begin_idx = np.where(sequence == timestamp_begin)[0].item() if timestamp_begin in sequence else 0
    107         sequence = sequence[begin_idx:]
    108 

ValueError: can only convert an array of size 1 to a Python scalar

Below is the code to use the pipeline.

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-tiny",
  chunk_length_s=30,
  device=device,
)

filename = files[71][0]
mypath = '/content/drive/MyDrive/twitch_data/audios/prediction/'
audio, _ = librosa.load(mypath+ filename, sr = 16000)

my_dict = {"raw": np.array(audio), 'sampling_rate': np.array(16000)}
prediction = pipe(my_dict, return_timestamps=True)["chunks"]

I'm not sure if this is a bug, or if there's something wrong with the files. Any help is appreciated!

Hey @pearlyu ! Thanks for flagging this and sorry for getting back to you so late. Are you able to reproduce this bug using an audio file we have access to on our end? Either you can share the audio file you get the error with, or try using an audio sample from a HF dataset:

from datasets import load_dataset

librispeech = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

sample = librispeech[0]["audio"]

prediction = pipe(sample, return_timestamps=True)["chunks"]

We'd need an audio file that breaks the pipeline in order to investigate what's going on!

Hi @sanchit-gandhi , the piece of code that you shared throws the following error:
ValueError: We cannot return_timestamps yet on non-ctc models !

Could you update transformers to the latest version please?

pip install --upgrade transformers

Sign up or log in to comment