sanchit-gandhi HF staff commited on
Commit
692ba36
1 Parent(s): b5ad047

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -351,8 +351,8 @@ This code snippet shows how to evaluate Whisper Medium on [LibriSpeech test-clea
351
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
352
  algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
353
  [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
354
- method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. It can also be extended to
355
- predict utterance level timestamps by passing `return_timestamps=True`:
356
 
357
  ```python
358
  >>> import torch
@@ -363,7 +363,7 @@ predict utterance level timestamps by passing `return_timestamps=True`:
363
 
364
  >>> pipe = pipeline(
365
  >>> "automatic-speech-recognition",
366
- >>> model="openai/whisper-medium",
367
  >>> chunk_length_s=30,
368
  >>> device=device,
369
  >>> )
@@ -371,15 +371,17 @@ predict utterance level timestamps by passing `return_timestamps=True`:
371
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
372
  >>> sample = ds[0]["audio"]
373
 
374
- >>> prediction = pipe(sample.copy())["text"]
375
  " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
376
 
377
  >>> # we can also return timestamps for the predictions
378
- >>> prediction = pipe(sample, return_timestamps=True)["chunks"]
379
  [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
380
  'timestamp': (0.0, 5.44)}]
381
  ```
382
 
 
 
383
  ## Fine-Tuning
384
 
385
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,
 
351
  The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking
352
  algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers
353
  [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
354
+ method. Chunking is enabled by setting `chunk_length_s=30` when instantiating the pipeline. With chunking enabled, the pipeline
355
+ can be run with batched inference. It can also be extended to predict sequence level timestamps by passing `return_timestamps=True`:
356
 
357
  ```python
358
  >>> import torch
 
363
 
364
  >>> pipe = pipeline(
365
  >>> "automatic-speech-recognition",
366
+ >>> model="openai/whisper-large-v2",
367
  >>> chunk_length_s=30,
368
  >>> device=device,
369
  >>> )
 
371
  >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
372
  >>> sample = ds[0]["audio"]
373
 
374
+ >>> prediction = pipe(sample.copy(), batch_size=8)["text"]
375
  " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
376
 
377
  >>> # we can also return timestamps for the predictions
378
+ >>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
379
  [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
380
  'timestamp': (0.0, 5.44)}]
381
  ```
382
 
383
+ Refer to the blog post [ASR Chunking](https://huggingface.co/blog/asr-chunking) for more details on the chunking algorithm.
384
+
385
  ## Fine-Tuning
386
 
387
  The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However,