return_timestamps error

#28

by pearlyu - opened Mar 4, 2023

Mar 4, 2023

When using the pipeline to get transcription with timestamps, it's alright for some audio files, but for some of the files it returns the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-8cc132230b9b> in <module>
----> 1 prediction = pipe(dataset[0], return_timestamps=True)["chunks"]

4 frames
/usr/local/lib/python3.8/dist-packages/transformers/pipelines/automatic_speech_recognition.py in _find_timestamp_sequence(sequences, tokenizer, feature_extractor, max_source_positions)
    104         sequence = sequence.squeeze(0)
    105         # get rid of the `forced_decoder_idx` that are use to parametrize the generation
--> 106         begin_idx = np.where(sequence == timestamp_begin)[0].item() if timestamp_begin in sequence else 0
    107         sequence = sequence[begin_idx:]
    108 

ValueError: can only convert an array of size 1 to a Python scalar

Below is the code to use the pipeline.

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-tiny",
  chunk_length_s=30,
  device=device,
)

filename = files[71][0]
mypath = '/content/drive/MyDrive/twitch_data/audios/prediction/'
audio, _ = librosa.load(mypath+ filename, sr = 16000)

my_dict = {"raw": np.array(audio), 'sampling_rate': np.array(16000)}
prediction = pipe(my_dict, return_timestamps=True)["chunks"]

I'm not sure if this is a bug, or if there's something wrong with the files. Any help is appreciated!

sanchit-gandhi

Mar 17, 2023

•

edited Mar 17, 2023

Hey @pearlyu ! Thanks for flagging this and sorry for getting back to you so late. Are you able to reproduce this bug using an audio file we have access to on our end? Either you can share the audio file you get the error with, or try using an audio sample from a HF dataset:

from datasets import load_dataset

librispeech = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

sample = librispeech[0]["audio"]

prediction = pipe(sample, return_timestamps=True)["chunks"]

We'd need an audio file that breaks the pipeline in order to investigate what's going on!

darveen

Apr 6, 2023

Hi @sanchit-gandhi , the piece of code that you shared throws the following error:
ValueError: We cannot return_timestamps yet on non-ctc models !

sanchit-gandhi

Apr 18, 2023

Could you update transformers to the latest version please?

pip install --upgrade transformers

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment