How to use whisper-small-cv11-french without training it?

#4
by slain62 - opened

Hello,

I would like to use your model (the aim is to transcribe an audio record from a meeting to text).
Here is my code snippet (yes, I know, it would me cleaner to create functions, call them, but for the moment it is just to test if the script works):
[code]

Imports

from pathlib import Path
import streamlit as st
import torch

from tempfile import NamedTemporaryFile
from datasets import load_dataset
from transformers import pipeline

Initialize environment

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-small-cv11-french", device=device)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

Display

st.title("Télécharger un enregistrement de réunion pour obtenir sa transcription en texte")
col1, col2 = st.columns(2)
audio_source=st.sidebar.file_uploader(label="Choisir votre fichier", type=["wav","m4a","mp3","wma"])

Variables

suffix = ""
predicted_sentence = ""

Processing

if audio_source is not None:
col1.toast("Début du traitement")
waveform = audio_source.getvalue()
col1.toast("Lancement du pipe")
predicted_sentence = pipe(waveform, max_new_tokens=225)
col2.write("Transcription :")
col2.write(predicted_sentence)
col2.download_button(label="Télécharger la transcription", data=predicted_sentence, file_name="transcript.txt",mime="text/plain")
[/code]

It results in an error :
[code]
ValueError: You have passed more than 3000 mel input features (> 30 seconds) which automatically enables long-form generation which requires the generation config to have no_timestamps_token_id correctly. Make sure to initialize the generation config with the correct attributes that are needed such as no_timestamps_token_id. For more details on how to generate the approtiate config, refer to https://github.com/huggingface/transformers/issues/21878#issuecomment-1451902363or make sure to pass no more than 3000 mel input features.
Traceback:
File "/home/ild/miniconda3/envs/transcript/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 542, in _run_script
exec(code, module.dict)
File "/home/ild/whisperFR.py", line 34, in
predicted_sentence = pipe(waveform, max_new_tokens=225)
File "/home/ild/.local/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in call
return super().call(inputs, **kwargs)
File "/home/ild/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1154, in call
return next(
File "/home/ild/.local/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in next
item = next(self.iterator)
File "/home/ild/.local/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 266, in next
processed = self.infer(next(self.iterator), **self.params)
File "/home/ild/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1068, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/home/ild/.local/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
tokens = self.model.generate(
File "/home/ild/.local/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 502, in generate
self._set_return_timestamps(
File "/home/ild/.local/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 969, in _set_return_timestamps
raise ValueError(
[/code]

I don't know if the problem is due to copy/pasting a script intended to train and fine-tune the model, just an argument I should pass somewhere or if I need to find another model (which would be small enough to be used with a 4GB GPU)?

Thank you

Hi,

Thank you for your interest. It appears that this model may not be compatible due to changes discussed in the HF issue. I've made adjustments to the config. Could you try again?

Hello,
Just tested it, now transcription works.
Do you think I can train it with this 4 GB GPU, or is this a very bad idea?

I'd say training with 4GB could be quite challenging

Sign up or log in to comment